Back to AI Lab

Robotics

Research papers, repositories, and articles about robotics

Showing 6 of 6 items

Evaluating Gemini Robotics Policies in a Veo World Simulator

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

Evaluating Gemini Robotics Policies in a Veo World Simulator

Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.

Kehong Gong, Zhengyu Wen

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.

Pei Yang, Hai Ci