Safety
Research papers, repositories, and articles about safety
Showing 9 of 9 items
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
SynthID Detector: Identify content made with Google's AI tools
Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.
Evaluating Gemini Robotics Policies in a Veo World Simulator
Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Reverse Thinking Enhances Missing Information Detection in Large Language Models
Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.
Exploring model welfare
Anthropic’s model welfare post argues that as AI systems become more capable and agentic, we may eventually need to consider their potential consciousness, preferences, and suffering, and launches a research program to explore these questions. For developers, it’s an early warning that future alignment and deployment practices—like training setups, evaluation methods, or deprecation policies—might incorporate welfare constraints in addition to traditional safety metrics. ([anthropic.com](https://www.anthropic.com/research/exploring-model-welfare))