ArXiv Paper

Evaluating Gemini Robotics Policies in a Veo World Simulator

Gemini Robotics Team, Coline Devin, Yilun Du +18December 11, 2025

Summary

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Topics

robotics video-models evaluation safety

View Original View PDF

Related Content

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

This report compares seven frontier language and vision models across many safety tests, from basic benchmarks to adversarial red-teaming. It finds GPT-5.2 clearly safest overall while others trade off safety across languages, modalities, and threat models.