Video
Research papers, repositories, and articles about video
Showing 22 of 22 items
Kling-Omni Technical Report
Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
LoomVideo packs video generation and editing into a single 5B model that talks to a multimodal language backbone. A clever "scale-and-add" trick lets it edit videos without doubling sequence length, so you get big speedups at similar quality. If you’re exploring small but strong video models, this is a new anchor point.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM turns heavy video diffusion models into fast, interactive "video world" simulators. It provides a full pipeline from data to few-step generators that run close to real time. If you care about agents in simulated worlds, this is an end-to-end recipe you can actually clone and run.
Action100M: A Large-scale Video Action Dataset
Action100M is a fully-automatic dataset built from over a million how-to videos, giving around 100 million labeled action snippets. It uses V-JEPA features plus a GPT-based pipeline to label segments, and it unlocks clean scaling curves for action recognition models.
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
WorldCanvas lets you script rich video scenes using text, object trajectories, and reference images instead of frame-by-frame editing. Use this as a template for building controllable world simulators and advanced video tools.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Hugging Face highlights WorldCanvas for its controllable video generation via text, paths, and reference images. Builders of simulation-style UIs should copy the interface ideas here.
Recurrent Video Masked Autoencoders
Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
OmniMem adds an external memory system to long-video generators so they can re-use past details instead of re-encoding full histories. It adapts which frames to keep, letting models generate longer, more coherent clips under fixed compute. If you’re chasing hour-scale video worlds, this is a template for managing context.
harry0703/MoneyPrinterTurbo
One-click short-video generator powered by large language models and media models. Non-technical creators can go from idea to polished clips extremely fast. If you care about AI-native content workflows, this is a live example of how far automation has moved.
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.
browser-use/video-use
Lets coding agents edit videos programmatically. Bridges dev tooling and media pipelines. If you're eyeing agentic video editing or auto-content workflows, this is a strong starting point. ([github.com](https://github.com/trending?since=daily))
EasyV2V: A High-quality Instruction-based Video Editing Framework
EasyV2V upgrades text-controlled video editing by cleverly generating training pairs from existing experts and images. If you're building video tools, this paper is a recipe for better data and architectures.
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Avatar Forcing combines a diffusion-based generator with a preference-optimization trick to drive talking-head avatars in real time. It reacts to a user’s speech and body motion with low latency, producing more expressive, conversational faces without needing labeled interaction data.
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
VIVA uses a vision-language model to encode instructions and a reward-optimized diffusion model to edit videos. Great blueprint for anyone mixing video generation with RL-style feedback.
AIDC-AI/Pixelle-Video
End-to-end pipeline for fully automated short-form video creation with AI. Takes scripts or prompts to generate clips, edits, and captions. If you run content operations, this shows where AI video automation is headed in practice. ([github.com](https://github.com/trending/python?since=daily))
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Seedance 1.5 pro jointly generates video and sound from one model rather than bolting audio on later. Content teams can use this to explore tightly synced audio-visual experiences.
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.
alexfazio/viral-clips-crew
CrewAI-powered video editing workflow aimed at making viral clips. Content creators can copy the stack rather than wiring agents and editors from scratch.
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.