Alignment & Safety
Interpretability, constitutional AI, red teaming, and ensuring beneficial AGI. Making sure AI systems remain helpful, honest, and harmless.
Key Benchmarks
Recent Papers
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Taekyung Ki, Sangwon Jang, Jaehyeong Jo +2 more
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma +2 more
Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
Nikhil Prakash, Donghao Ren, Dominik Moritz +1 more
How Good is Post-Hoc Watermarking With Language Model Rephrasing?
Pierre Fernandez, Tom Sander, Hady Elsahar +6 more
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu, Chengzhi Mao, Yaojie Liu +2 more
Adaptation of Agentic AI
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi +15 more
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Qihao Liu, Chengzhi Mao, Yaojie Liu +2 more
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Asa Cooper Stickland, Jan Michelfeit, Arathi Mani +6 more
Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection
Zihui Zhao, Zechang Li
SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning
Emre Can Acikgoz, Jinoh Oh, Jie Hao +7 more
Recent Milestones
New toolkit scores genAI dataset provenance
On December 31, 2025, Quantum Zeitgeist reported on a new compliance rating framework and open-source Python library for assessing data provenance in generative AI training datasets. The work, led by researchers at Imperial College London, aims to track origin, licensing and ethical safeguards as AI datasets scale exponentially.
OpenAI creates high‑stakes Head of Preparedness
OpenAI has posted a senior "Head of Preparedness" role responsible for evaluating and mitigating risks from its frontier AI systems. External reporting on December 29 details CEO Sam Altman’s warning that the job will be highly stressful and focused on cyber, biosecurity and mental‑health risks.
OpenAI Elevates Extreme-Risk Preparedness
OpenAI is recruiting a new Head of Preparedness to lead its safety systems framework, offering an annual salary of $555,000 plus equity. The role will oversee threat models and mitigations for severe AI risks spanning cybersecurity, biosecurity and mental health, and was publicly highlighted by CEO Sam Altman, who called it a ‘stressful’ but critical job.([businessinsider.com](https://www.businessinsider.com/challenges-of-openai-head-of-preparedness-role-2025-12))
OpenAI Elevates Frontier Risk to Executive Level
On December 29, 2025, OpenAI publicly advertised a senior “Head of Preparedness” role to oversee emerging risks from its most advanced models, with reported compensation around $550,000–$555,000 plus equity. CEO Sam Altman described the job as a stressful, high‑stakes position focused on threats ranging from cybersecurity misuse to mental‑health harms and catastrophic scenarios.
Anthropic RSP Framework Updated
Anthropic updates Responsible Scaling Policy with ASL-3 deployment safeguards.
EU AI Act Enters into Force
EU AI Act becomes law, establishing comprehensive AI regulations.
OpenAI Superalignment Team Dissolution
OpenAI superalignment team leaders depart, raising concerns about safety prioritization.
US AI Safety Institute Established
US establishes AI Safety Institute under NIST for AI safety standards and testing.
UK AI Safety Institute Launch
UK establishes the AI Safety Institute (AISI) at Bletchley Park for frontier AI safety research.