Back to AI Lab

Alignment

Research papers, repositories, and articles about alignment

Showing 15 of 15 items

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5 is a small family of guardrail models trained on a detailed agent-risk taxonomy with surprisingly few samples. They can sit in front of powerful agents, flag dangerous actions, and run cheaply. If you build tool-using agents, this is emerging as a standard safety baseline to copy or test against.

Dongrui Liu, Yu Li

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Argues you can reuse the policy and reference from RL post-training to define a "progress advantage" signal instead of training a separate process reward model. This gives dense step-wise scores for agents while avoiding another fragile model in the loop. If you're drowning in reward-model complexity, this suggests a cheaper alignment path. ([huggingface.co](https://huggingface.co/papers/2606.26080))

Changdae Oh, Wendi Li

Verbalizing LLMs' assumptions to explain and control sycophancy

Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.

Myra Cheng, Isabel Sieh

ART: Adaptive Reasoning Trees for Explainable Claim Verification

ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.

Sahil Wadhwa, Himanshu Kumar

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.

Zihui Zhao, Zechang Li

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.

Parisa Rabbani, Priyam Sahoo

RobotValues: Evaluating Household Robots When Human Values Conflict

RobotValues throws household robots into 10,000 situations where human values clash, like privacy versus safety. Vision-language models often default to their own value preferences and fail 80% of the time when told to prioritize a different value. Use this benchmark if you’re serious about value-sensitive robot behavior, not just task success.

Jongwook Han, Hyeongjin Kim

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.

Peter Holderrieth, Douglas Chen

Characterizing the Consistency of the Emergent Misalignment Persona

Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))

Anietta Weckauff, Yuchen Zhang

The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models

This paper shows that giving models medical “doctor” personas helps on some emergency tasks but can hurt performance in primary care settings. Teams using personas for safety or expertise should test them task by task instead of assuming they always help.

Tassallah Abdullahi, Shrestha Ghosh

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.

Haomin Zhuang, Hojun Yoo

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

The authors study adding watermarks after the fact by having a model rewrite existing text while embedding tracking signals. They map how beam search, sampling tricks, and model size trade off between detection strength and text quality, and show watermarks work better on prose than on verifiable code. ([ar5iv.org](https://ar5iv.org/abs/2512.16904))

Pierre Fernandez, Tom Sander

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.

Tarun Suresh, Nalin Wadhwa