Robotics
Research papers, repositories, and articles about robotics
Showing 30 of 30 items
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Lets coding agents run real robots in a closed loop and continuously improve policies with minimal human babysitting. Robotics groups should treat this as a design template for autonomous labs.
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA trains a single system to handle many robot tasks, environments, and bodies, instead of maintaining separate models. It shows strong generalization in manipulation, navigation, and trajectory prediction. If you’re running fleets of robots, this points to one shared brain rather than dozens of bespoke ones.
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Finds that filtered first-person human videos can beat costly robot demonstrations for pretraining. If you’re collecting robot data manually, you should test this cheaper pipeline.
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Shows that carefully processed first-person human videos can beat expensive teleoperated robot data for pretraining embodied models. If robot data collection is killing your budget, start experimenting with egocentric video pipelines.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper proposes a single model that predicts future world state, plans in language, and outputs robot actions. It uses an autoregressive backbone tied to a "world expert" module for physical dynamics. Think of it as a step toward robots that learn from video and instructions without separate planning stacks.
DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
Dexterous robot hands learn to move articulated objects by reasoning about contact, not just motion paths. Try this if you’re hitting brittleness in contact-heavy manipulation tasks.
DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
Introduces a contact-driven control scheme so robot hands move articulated objects by managing physical contact, not replayed trajectories. Useful if you’re training hands for doors, drawers, and other jointed objects without rich tactile sensors.
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring
Hide-and-Seek detects when vision-language-action robots are about to fail, using only trajectory-level labels. It learns which individual actions signal trouble without step-by-step annotation. If you run embodied agents, this is a practical way to catch bad executions before they break hardware.
Evaluating Gemini Robotics Policies in a Veo World Simulator
Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Evaluating Gemini Robotics Policies in a Veo World Simulator
Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Hallucination in World Models is Predictable and Preventable
Builds a big benchmark with ground-truth simulators to show where visual world models drift from reality. Identifies three failure modes and three simple signals that reliably flag them. If you deploy action-driven world models, you can use these signals as runtime tripwires. ([huggingface.co](https://huggingface.co/papers/2606.27326))
Playful Agentic Robot Learning
Robots practice through playful exploration, then reuse those skills for real tasks. If you script every task by hand, this points to a cheaper, more scalable path.
Playful Agentic Robot Learning
Robots learn skills by self-directed play, then reuse those skills to solve real tasks with little extra training. If you run embodied agents, steal this curriculum idea to cut hand-designed task scripts.
RobotValues: Evaluating Household Robots When Human Values Conflict
RobotValues throws household robots into 10,000 situations where human values clash, like privacy versus safety. Vision-language models often default to their own value preferences and fail 80% of the time when told to prioritize a different value. Use this benchmark if you’re serious about value-sensitive robot behavior, not just task success.
commaai/openpilot
Open-source driving stack that turns supported cars into level-2+ driver-assist systems. Strong example of long-running applied robotics with heavy ML. Study how they ship safety-critical updates in the open.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA bakes an understanding of "what you can do with this object" directly into a robot’s vision-language-action model. It predicts affordances first, then uses them to guide actions. This is a concrete recipe if you want robots that manipulate objects reliably instead of just describing them.
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
Builds a spatial “tool-using” agent that lets vision-language models maintain a live 3D picture of the world over time. Use this if your agents constantly forget what’s where in multi-view or video settings.
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
Gives agents a toolbox for understanding changing 3D scenes across views and time. Use this if your vision agents lose track of objects once the camera moves.
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Depth Any Panoramas builds a single model for depth on 360° indoor and outdoor scenes. Robotics and AR teams can reuse this instead of training per-dataset depth nets.
Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning
Posterior Behavioral Cloning shows how the way you pretrain policies can make downstream reinforcement learning far cheaper. Robotics teams can adopt this to cut expensive environment time.
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))
DIMOS: Disentangling Instance-level Moving Object Segmentation
DIMOS fuses event camera streams with RGB frames and explicitly separates motion from appearance features in both modalities. That dual disentangling plus cross-modal alignment sharply improves moving-object segmentation, especially for small or fast targets in low light.
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))
Future Optical Flow Prediction Improves Robot Control & Video Generation
FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.
Visual Place Recognition in Forests with Depth-Aware Distillation
The authors improve visual place recognition in forests by distilling depth cues into a DINOv2 descriptor. Their lightweight model keeps the original embedding space but becomes more robust to lighting and seasonal changes on a recent forest benchmark.
pollen-robotics/reachy_mini
SDK for the Reachy Mini robot, giving Python hooks into perception and control. It’s a practical playground for connecting language models to real-world robot arms. ([github.com](https://github.com/trending))
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.