Video

Research papers, repositories, and articles about video

Showing 5 of 5 items

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.

Jianxiong Gao, Zhaoxi Chen

Recurrent Video Masked Autoencoders

Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.

Daniel Zoran, Nikhil Parthasarathy

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.

Sangwoon Kwak, Weeyoung Kwon