Back to AI Lab

Video

Research papers, repositories, and articles about video

Showing 22 of 22 items

Kling-Omni Technical Report

Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.

Kling Team, Jialu Chen

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

LoomVideo packs video generation and editing into a single 5B model that talks to a multimodal language backbone. A clever "scale-and-add" trick lets it edit videos without doubling sequence length, so you get big speedups at similar quality. If you’re exploring small but strong video models, this is a new anchor point.

Jianzong Wu, Hao Lian

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM turns heavy video diffusion models into fast, interactive "video world" simulators. It provides a full pipeline from data to few-step generators that run close to real time. If you care about agents in simulated worlds, this is an end-to-end recipe you can actually clone and run.

Min Zhao, Hongzhou Zhu

Action100M: A Large-scale Video Action Dataset

Action100M is a fully-automatic dataset built from over a million how-to videos, giving around 100 million labeled action snippets. It uses V-JEPA features plus a GPT-based pipeline to label segments, and it unlocks clean scaling curves for action recognition models.

Delong Chen, Tejaswi Kasarla

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

WorldCanvas lets you script rich video scenes using text, object trajectories, and reference images instead of frame-by-frame editing. Use this as a template for building controllable world simulators and advanced video tools.

Hanlin Wang, Hao Ouyang

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.

Jianxiong Gao, Zhaoxi Chen

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hugging Face highlights WorldCanvas for its controllable video generation via text, paths, and reference images. Builders of simulation-style UIs should copy the interface ideas here.

Hanlin Wang, Hao Ouyang

Recurrent Video Masked Autoencoders

Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.

Daniel Zoran, Nikhil Parthasarathy

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

OmniMem adds an external memory system to long-video generators so they can re-use past details instead of re-encoding full histories. It adapts which frames to keep, letting models generate longer, more coherent clips under fixed compute. If you’re chasing hour-scale video worlds, this is a template for managing context.

Lin Zhao, Yushu Wu

harry0703/MoneyPrinterTurbo

One-click short-video generator powered by large language models and media models. Non-technical creators can go from idea to polished clips extremely fast. If you care about AI-native content workflows, this is a live example of how far automation has moved.

74,090

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

VideoAR builds a large visual autoregressive model that predicts videos frame by frame across multiple scales. It narrows the quality gap with diffusion models while needing far fewer steps, which makes long video generation cheaper to run.

Longbin Ji, Xiaoxiong Liu

browser-use/video-use

Lets coding agents edit videos programmatically. Bridges dev tooling and media pipelines. If you're eyeing agentic video editing or auto-content workflows, this is a strong starting point. ([github.com](https://github.com/trending?since=daily))

11,017

EasyV2V: A High-quality Instruction-based Video Editing Framework

EasyV2V upgrades text-controlled video editing by cleverly generating training pairs from existing experts and images. If you're building video tools, this paper is a recipe for better data and architectures.

Jinjie Mai, Chaoyang Wang

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing combines a diffusion-based generator with a preference-optimization trick to drive talking-head avatars in real time. It reacts to a user’s speech and body motion with low latency, producing more expressive, conversational faces without needing labeled interaction data.

Taekyung Ki, Sangwon Jang

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

VIVA uses a vision-language model to encode instructions and a reward-optimized diffusion model to edit videos. Great blueprint for anyone mixing video generation with RL-style feedback.

Xiaoyan Cong, Haotian Yang

AIDC-AI/Pixelle-Video

End-to-end pipeline for fully automated short-form video creation with AI. Takes scripts or prompts to generate clips, edits, and captions. If you run content operations, this shows where AI video automation is headed in practice. ([github.com](https://github.com/trending/python?since=daily))

10,029

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Seedance 1.5 pro jointly generates video and sound from one model rather than bolting audio on later. Content teams can use this to explore tightly synced audio-visual experiences.

Heyi Chen, Siyan Chen

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu

alexfazio/viral-clips-crew

CrewAI-powered video editing workflow aimed at making viral clips. Content creators can copy the stack rather than wiring agents and editors from scratch.

685

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.

Sangwoon Kwak, Weeyoung Kwon