Robotics
Research papers, repositories, and articles about robotics
Showing 6 of 6 items
Evaluating Gemini Robotics Policies in a Veo World Simulator
Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Evaluating Gemini Robotics Policies in a Veo World Simulator
Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.