Alignment
Research papers, repositories, and articles about alignment
Showing 15 of 15 items
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
AgentDoG 1.5 is a small family of guardrail models trained on a detailed agent-risk taxonomy with surprisingly few samples. They can sit in front of powerful agents, flag dangerous actions, and run cheaply. If you build tool-using agents, this is emerging as a standard safety baseline to copy or test against.
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Argues you can reuse the policy and reference from RL post-training to define a "progress advantage" signal instead of training a separate process reward model. This gives dense step-wise scores for agents while avoiding another fragile model in the loop. If you're drowning in reward-model complexity, this suggests a cheaper alignment path. ([huggingface.co](https://huggingface.co/papers/2606.26080))
Verbalizing LLMs' assumptions to explain and control sycophancy
Gets models to spell out their hidden assumptions before answering. That makes it easier to spot flattery-driven answers and dial them down.
ART: Adaptive Reasoning Trees for Explainable Claim Verification
ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.
Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection
Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.
RobotValues: Evaluating Household Robots When Human Values Conflict
RobotValues throws household robots into 10,000 situations where human values clash, like privacy versus safety. Vision-language models often default to their own value preferences and fail 80% of the time when told to prioritize a different value. Use this benchmark if you’re serious about value-sensitive robot behavior, not just task success.
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))
The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models
This paper shows that giving models medical “doctor” personas helps on some emergency tasks but can hurt performance in primary care settings. Teams using personas for safety or expertise should test them task by task instead of assuming they always help.
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.
How Good is Post-Hoc Watermarking With Language Model Rephrasing?
The authors study adding watermarks after the fact by having a model rewrite existing text while embedding tracking signals. They map how beam search, sampling tricks, and model size trade off between detection strength and text quality, and show watermarks work better on prose than on verifiable code. ([ar5iv.org](https://ar5iv.org/abs/2512.16904))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.