Safety

Research papers, repositories, and articles about safety

Showing 9 of 9 items

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Zijian Wu, Lingkai Kong

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

Songyang Gao, Yuzhe Gu

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

Google AI Blog

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

Evaluating Gemini Robotics Policies in a Veo World Simulator

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))

Yuxin Liu, Chaojie Gu

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.

Songqiao Hu, Zeyi Liu

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.

Tarun Suresh, Nalin Wadhwa

Exploring model welfare

Anthropic’s model welfare post argues that as AI systems become more capable and agentic, we may eventually need to consider their potential consciousness, preferences, and suffering, and launches a research program to explore these questions. For developers, it’s an early warning that future alignment and deployment practices—like training setups, evaluation methods, or deprecation policies—might incorporate welfare constraints in addition to traditional safety metrics. ([anthropic.com](https://www.anthropic.com/research/exploring-model-welfare))

Anthropic