Back to AI Lab

Safety

Research papers, repositories, and articles about safety

Showing 36 of 36 items

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.

Zijian Wu, Lingkai Kong

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.

Songyang Gao, Yuzhe Gu

SynthID Detector: Identify content made with Google's AI tools

Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))

Google AI Blog

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

This report compares seven frontier language and vision models across many safety tests, from basic benchmarks to adversarial red-teaming. It finds GPT-5.2 clearly safest overall while others trade off safety across languages, modalities, and threat models.

Xingjun Ma, Yixu Wang

Partnering with Mozilla to improve Firefox’s security

Anthropic used Claude Opus 4.6 to scan Firefox’s code and surfaced 22 new vulnerabilities, 14 rated high severity. The post lays out a playbook for pairing AI bug hunters with human maintainers safely.

Anthropic Newsroom

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.

Jiajie Zhang, Xin Lv

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Spider-Sense bakes a threat detector into the agent itself, so it only runs heavy safety checks when it senses risk. It keeps attack success low and false positives rare while adding little delay.

Zhenxiong Yu, Zhi Yang

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.

Asa Cooper Stickland, Jan Michelfeit

LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.

Yilin Xiao, Jin Chen

Verbalizing LLMs' assumptions to explain and control sycophancy

Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.

Myra Cheng, Isabel Sieh

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

CAR-bench builds an in-car assistant world with messy, ambiguous user requests and many tools. It measures not just if agents finish tasks, but whether they know when they’re out of their depth.

Johannes Kirmayr, Lukas Stappen

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

ToolSafe builds a guardrail model that watches each tool call an agent plans to make and flags dangerous ones before they run. In tool-using agents under prompt-injection attacks, it slashes harmful calls while slightly improving task success.

Yutao Mou, Zhangchi Xue

ART: Adaptive Reasoning Trees for Explainable Claim Verification

ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.

Sahil Wadhwa, Himanshu Kumar

Adaptation of Agentic AI

This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.

Pengcheng Jiang, Jiacheng Lin

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

AuditDM trains an "auditor" model that hunts for cases where strong vision-language models disagree. Teams can reuse these hard examples to patch weaknesses without manual labeling.

Qihao Liu, Chengzhi Mao

Evaluating Gemini Robotics Policies in a Veo World Simulator

Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))

Gemini Robotics Team, Coline Devin

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.

Shuo Nie, Hexuan Deng

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.

Parisa Rabbani, Priyam Sahoo

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

BAPO trains search-based agents not just to answer, but to know when to say "I don't know". It adds special rewards that encourage honest uncertainty without letting agents abuse that response to duck work.

Shiyu Liu, Yongjing Yin

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))

Ivan Bercovich

CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

CTHA adds formal communication contracts and authority limits between fast and slow agent layers. That stabilizes multi-level agent stacks and sharply reduces cascades of bad decisions.

Percy Jardine

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Hugging Face surfaces AuditDM as a practical recipe for stress-testing multiple models at once. Use it to decide where smaller, cheaper models can safely replace bigger ones.

Qihao Liu, Chengzhi Mao

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.

Changcheng Li, Jiancan Wu

Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models

Crisis-Bench drops models into simulated multi-day corporate crises and scores them on stock-price outcomes and public sentiment. It exposes when models act like blunt truth-tellers versus savvy spokespeople, giving companies a way to test PR-style agent behavior before deployment.

Cooper Lin, Maohao Ran

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))

Yuxin Liu, Chaojie Gu

Characterizing the Consistency of the Emergent Misalignment Persona

Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))

Anietta Weckauff, Yuchen Zhang

The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models

This paper shows that giving models medical “doctor” personas helps on some emergency tasks but can hurt performance in primary care settings. Teams using personas for safety or expertise should test them task by task instead of assuming they always help.

Tassallah Abdullahi, Shrestha Ghosh

Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models

The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.

Gerard Yeo, Svetlana Churina

Conformity and Social Impact on AI Agents

Researchers adapt classic social-psychology experiments to AI agents and find they also conform to group pressure. Even strong models can be pushed into wrong answers by coordinated peers, which raises real worries for multi-agent deployments and information ecosystems.

Alessandro Bellina, Giordano De Marzo

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.

Haomin Zhuang, Hojun Yoo

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

The authors study adding watermarks after the fact by having a model rewrite existing text while embedding tracking signals. They map how beam search, sampling tricks, and model size trade off between detection strength and text quality, and show watermarks work better on prose than on verifiable code. ([ar5iv.org](https://ar5iv.org/abs/2512.16904))

Pierre Fernandez, Tom Sander

Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core

This paper experiments with aggressively "forgetting" facts while preserving reasoning ability in a small Qwen model. The model loses targeted knowledge yet starts to lean harder on explicit reasoning steps.

Mengmeng Peng, Zhenyu Fang

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.

Songqiao Hu, Zeyi Liu

Effects of personality steering on cooperative behavior in Large Language Model agents

The authors test how adding human-like personality traits changes how AI agents cooperate in repeated Prisoner’s Dilemma games. They find agreeableness boosts cooperation but can also make agents easier to exploit, warning that persona dials act as soft biases, not hard controls.

Mizuki Sakai, Mizuki Yokoyama

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.

Tarun Suresh, Nalin Wadhwa

Introducing Microsoft innovations and programs to support AI-powered teaching and learning

Microsoft announces new tools and guidance for using AI safely in schools, plus security and AI playbooks for education leaders. If you run an institution, this is a concrete starting kit.

Microsoft Education Blog