H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))
Hai Ci, Xiaokang Liu