Video
Research papers, repositories, and articles about video
Showing 5 of 5 items
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
Recurrent Video Masked Autoencoders
Extends masked autoencoding to video with a recurrent architecture that can process long clips efficiently. Instead of treating frames independently or relying on heavy 3D convolutions, the model reuses temporal state to reconstruct masked patches over time, improving efficiency and temporal coherence. Strong authorship from the Zisserman/Carreira lineage suggests this could become a go‑to backbone for long-horizon video understanding.
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
MoRel is a 4D Gaussian Splatting framework designed for long, motion-heavy videos, where naive 4DGS breaks down due to memory blowup and temporal flicker. It introduces anchor relay–based bidirectional blending and feature-variance–guided densification to maintain temporal coherence and handle occlusions over long time spans, and comes with a new long-range motion dataset for evaluation.