Alignment

Research papers, repositories, and articles about alignment

Showing 4 of 4 items

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.

Zihui Zhao, Zechang Li

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER is a deterministic verifier for large language models that computes tight, provably-sound bounds on the probability that a model satisfies a given semantic constraint. Instead of sampling and hoping for the best, it systematically explores the token space with specialized data structures, yielding much sharper risk estimates for correctness, privacy, and security-critical applications.

Tarun Suresh, Nalin Wadhwa