Back to AI Lab

Robotics

Research papers, repositories, and articles about robotics

Showing 30 of 30 items

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Lets coding agents run real robots in a closed loop and continuously improve policies with minimal human babysitting. Robotics groups should treat this as a design template for autonomous labs.

Wenli Xiao, Jia Xie

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.

Wenli Xiao, Jia Xie

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA trains a single system to handle many robot tasks, environments, and bodies, instead of maintaining separate models. It shows strong generalization in manipulation, navigation, and trajectory prediction. If you’re running fleets of robots, this points to one shared brain rather than dozens of bespoke ones.

Qiuyue Wang, Mingsheng Li

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Finds that filtered first-person human videos can beat costly robot demonstrations for pretraining. If you’re collecting robot data manually, you should test this cheaper pipeline.

Juncheng Ma, Jianxin Bi

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Shows that carefully processed first-person human videos can beat expensive teleoperated robot data for pretraining embodied models. If robot data collection is killing your budget, start experimenting with egocentric video pipelines.

Juncheng Ma, Jianxin Bi

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

This paper proposes a single model that predicts future world state, plans in language, and outputs robot actions. It uses an autoregressive backbone tied to a "world expert" module for physical dynamics. Think of it as a step toward robots that learn from video and instructions without separate planning stacks.

Yi Yang, Zhihong Liu

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

Dexterous robot hands learn to move articulated objects by reasoning about contact, not just motion paths. Try this if you’re hitting brittleness in contact-heavy manipulation tasks.

Tianshan Zhang, Yijia Duan

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

Introduces a contact-driven control scheme so robot hands move articulated objects by managing physical contact, not replayed trajectories. Useful if you’re training hands for doors, drawers, and other jointed objects without rich tactile sensors.

Tianshan Zhang, Yijia Duan

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Hide-and-Seek detects when vision-language-action robots are about to fail, using only trajectory-level labels. It learns which individual actions signal trouble without step-by-step annotation. If you run embodied agents, this is a practical way to catch bad executions before they break hardware.

Seongheon Park, Wendi Li

Evaluating Gemini Robotics Policies in a Veo World Simulator

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

Evaluating Gemini Robotics Policies in a Veo World Simulator

Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

Hallucination in World Models is Predictable and Preventable

Builds a big benchmark with ground-truth simulators to show where visual world models drift from reality. Identifies three failure modes and three simple signals that reliably flag them. If you deploy action-driven world models, you can use these signals as runtime tripwires. ([huggingface.co](https://huggingface.co/papers/2606.27326))

Nicklas Hansen, Xiaolong Wang

Playful Agentic Robot Learning

Robots practice through playful exploration, then reuse those skills for real tasks. If you script every task by hand, this points to a cheaper, more scalable path.

Junyi Zhang, Jiaxin Ge

Playful Agentic Robot Learning

Robots learn skills by self-directed play, then reuse those skills to solve real tasks with little extra training. If you run embodied agents, steal this curriculum idea to cut hand-designed task scripts.

Junyi Zhang, Jiaxin Ge

RobotValues: Evaluating Household Robots When Human Values Conflict

RobotValues throws household robots into 10,000 situations where human values clash, like privacy versus safety. Vision-language models often default to their own value preferences and fail 80% of the time when told to prioritize a different value. Use this benchmark if you’re serious about value-sensitive robot behavior, not just task success.

Jongwook Han, Hyeongjin Kim

commaai/openpilot

Open-source driving stack that turns supported cars into level-2+ driver-assist systems. Strong example of long-running applied robotics with heavy ML. Study how they ship safety-critical updates in the open.

62,369

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA bakes an understanding of "what you can do with this object" directly into a robot’s vision-language-action model. It predicts affordances first, then uses them to guide actions. This is a concrete recipe if you want robots that manipulate objects reliably instead of just describing them.

Qize Yu, Jiadi You

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Builds a spatial “tool-using” agent that lets vision-language models maintain a live 3D picture of the world over time. Use this if your agents constantly forget what’s where in multi-view or video settings.

Yalun Dai, Hao Li

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Gives agents a toolbox for understanding changing 3D scenes across views and time. Use this if your vision agents lose track of objects once the camera moves.

Yalun Dai, Hao Li

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Depth Any Panoramas builds a single model for depth on 360° indoor and outdoor scenes. Robotics and AR teams can reuse this instead of training per-dataset depth nets.

Xin Lin, Meixi Song

Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Posterior Behavioral Cloning shows how the way you pretrain policies can make downstream reinforcement learning far cheaper. Robotics teams can adopt this to cut expensive environment time.

Andrew Wagenmaker, Perry Dong

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

HF surfaces H2R-Grounder as a robotics paper that uses unpaired training to translate human interaction videos into realistic robot manipulation videos via video diffusion. It’s notable because it points to scaling robot learning from the vast pool of human internet videos without curating large paired robot datasets. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS fuses event camera streams with RGB frames and explicitly separates motion from appearance features in both modalities. That dual disentangling plus cross-modal alignment sharply improves moving-object segmentation, especially for small or fast targets in low light.

Hongxiang Huang, Hongwei Ren

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Introduces a video-to-video translation framework that turns everyday human-object interaction videos into robot-manipulation videos with realistic, physically grounded motion, trained only on unpaired robot videos. The method uses inpainting and pose cues to bridge the embodiment gap and fine-tunes a Wan 2.2 diffusion model for temporally coherent, robot-conditional video synthesis. ([huggingface.co](https://huggingface.co/papers/2512.09406))

Hai Ci, Xiaokang Liu

Future Optical Flow Prediction Improves Robot Control & Video Generation

FOFPred trains a language-conditioned vision model to forecast dense motion fields in video. That single model then boosts both robot control and text-guided video generation in downstream tasks.

Kanchana Ranasinghe, Honglu Zhou

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

SceneDiff gives a new benchmark and a strong baseline for detecting object changes across views and time. Useful for robots that must notice what actually moved, not just viewpoint shifts.

Yuqun Wu, Chih-hao Lin

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything defines Category-Agnostic Motion Capture: given a monocular video and any rigged 3D asset, reconstruct motions that directly drive that specific skeleton. Using a reference-guided, factorized pipeline with a unified motion decoder and a curated Truebones Zoo dataset, it delivers high-quality animations and cross-species retargeting, making video-driven motion capture much more flexible for arbitrary 3D assets.

Kehong Gong, Zhengyu Wen

Visual Place Recognition in Forests with Depth-Aware Distillation

The authors improve visual place recognition in forests by distilling depth cues into a DINOv2 descriptor. Their lightweight model keeps the original embedding space but becomes more robust to lighting and seasonal changes on a recent forest benchmark.

Walter Nedov, Saimunur Rahman

pollen-robotics/reachy_mini

SDK for the Reachy Mini robot, giving Python hooks into perception and control. It’s a practical playground for connecting language models to real-world robot arms. ([github.com](https://github.com/trending))

486

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

X-Humanoid presents a scalable way to "robotize" human videos, turning ordinary human motion into humanoid-robot video at scale. By adapting a powerful video generative model and building a large synthetic paired dataset in Unreal Engine, it can translate complex third-person human motions into physically plausible humanoid animations, unlocking web-scale data for embodied AI.

Pei Yang, Hai Ci