Safety
Research papers, repositories, and articles about safety
Showing 36 of 36 items
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
SynthID Detector: Identify content made with Google's AI tools
Google announces SynthID Detector, a web portal that lets you upload images, audio, video, or text generated with Google AI tools and automatically checks for imperceptible SynthID watermarks, highlighting which parts of the content are likely watermarked. For developers and media teams, it’s a turnkey authenticity check for content produced with models like Gemini, Imagen, Lyria, and Veo, designed to plug into editorial and trust-&-safety workflows. ([blog.google](https://blog.google/technology/ai/google-synthid-ai-content-detector/))
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
This report compares seven frontier language and vision models across many safety tests, from basic benchmarks to adversarial red-teaming. It finds GPT-5.2 clearly safest overall while others trade off safety across languages, modalities, and threat models.
Partnering with Mozilla to improve Firefox’s security
Anthropic used Claude Opus 4.6 to scan Firefox’s code and surfaced 22 new vulnerabilities, 14 rated high severity. The post lays out a playbook for pairing AI bug hunters with human maintainers safely.
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
Spider-Sense bakes a threat detector into the agent itself, so it only runs heavy safety checks when it senses risk. It keeps attack success low and false positives rare while adding little delay.
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
Shows how to poison graph-structured knowledge used by retrieval-augmented systems. Focuses on attacks that subtly flip logical conclusions, not just surface facts.
Verbalizing LLMs' assumptions to explain and control sycophancy
Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
CAR-bench builds an in-car assistant world with messy, ambiguous user requests and many tools. It measures not just if agents finish tasks, but whether they know when they’re out of their depth.
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
ToolSafe builds a guardrail model that watches each tool call an agent plans to make and flags dangerous ones before they run. In tool-using agents under prompt-injection attacks, it slashes harmful calls while slightly improving task success.
ART: Adaptive Reasoning Trees for Explainable Claim Verification
ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.
Adaptation of Agentic AI
This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
AuditDM trains an "auditor" model that hunts for cases where strong vision-language models disagree. Teams can reuse these hard examples to patch weaknesses without manual labeling.
Evaluating Gemini Robotics Policies in a Veo World Simulator
Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.
BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
BAPO trains search-based agents not just to answer, but to know when to say "I don't know". It adds special rewards that encourage honest uncertainty without letting agents abuse that response to duck work.
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))
CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems
CTHA adds formal communication contracts and authority limits between fast and slow agent layers. That stabilizes multi-level agent stacks and sharply reduces cascades of bad decisions.
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Hugging Face surfaces AuditDM as a practical recipe for stress-testing multiple models at once. Use it to decide where smaller, cheaper models can safely replace bigger ones.
Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Lets models estimate their confidence before writing full answers. That enables routing hard questions to stronger models and skipping easy ones to save money.
Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models
Crisis-Bench drops models into simulated multi-day corporate crises and scores them on stock-price outcomes and public sentiment. It exposes when models act like blunt truth-tellers versus savvy spokespeople, giving companies a way to test PR-style agent behavior before deployment.
Reverse Thinking Enhances Missing Information Detection in Large Language Models
Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))
The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models
This paper shows that giving models medical “doctor” personas helps on some emergency tasks but can hurt performance in primary care settings. Teams using personas for safety or expertise should test them task by task instead of assuming they always help.
Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models
The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.
Conformity and Social Impact on AI Agents
Researchers adapt classic social-psychology experiments to AI agents and find they also conform to group pressure. Even strong models can be pushed into wrong answers by coordinated peers, which raises real worries for multi-agent deployments and information ecosystems.
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.
How Good is Post-Hoc Watermarking With Language Model Rephrasing?
The authors study adding watermarks after the fact by having a model rewrite existing text while embedding tracking signals. They map how beam search, sampling tricks, and model size trade off between detection strength and text quality, and show watermarks work better on prose than on verifiable code. ([ar5iv.org](https://ar5iv.org/abs/2512.16904))
Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core
This paper experiments with aggressively "forgetting" facts while preserving reasoning ability in a small Qwen model. The model loses targeted knowledge yet starts to lean harder on explicit reasoning steps.
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Introduces a safety constraint layer that can be bolted onto vision-language-action (VLA) models to filter unsafe actions before execution. Rather than retraining the whole control stack, VLSA learns a lightweight safety module that reasons jointly over visual context, language goals, and proposed actions. This aligns with the growing push for ‘safety shields’ around otherwise capable but unaligned agents.
Effects of personality steering on cooperative behavior in Large Language Model agents
The authors test how adding human-like personality traits changes how AI agents cooperate in repeated Prisoner’s Dilemma games. They find agreeableness boosts cooperation but can also make agents easier to exploit, warning that persona dials act as soft biases, not hard controls.
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.
Introducing Microsoft innovations and programs to support AI-powered teaching and learning
Microsoft announces new tools and guidance for using AI safely in schools, plus security and AI playbooks for education leaders. If you run an institution, this is a concrete starting kit.