Rlhf
Research papers, repositories, and articles about rlhf
Showing 10 of 10 items
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
This paper is a systematic exploration of reinforcement learning for text-to-3D generation, dissecting reward design, RL algorithms, data scaling, and hierarchical optimization. The authors introduce a new benchmark (MME-3DR), propose Hi-GRPO for global-to-local 3D refinement, and build AR3D-R1—the first RL-tuned text-to-3D model that improves both global shape quality and fine-grained texture alignment.
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
This work changes reinforcement learning for LLMs to reward correct but uncommon solution strategies, not just the first one that works. That raises pass@k without tanking single-answer performance, which matters if you sample multiple candidates.
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))
Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection
Builds on Direct Preference Optimization but tackles its weak learning signal when both preferred and rejected responses share similar flaws. RPO adds a hint-guided reflection step that encourages the model to produce more contrastive, informative preference pairs before optimizing them. The result is a more stable and data-efficient on-policy alignment pipeline that still avoids full RLHF/RLAIF complexity.
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
This paper shows why some learning-from-verifiable-feedback methods push models toward bloated answers. The authors fix the loss so you can improve reasoning without secretly optimizing for longer outputs.
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Uses a language model’s own feedback as a training signal for retrieval rerankers in RAG pipelines. Aims to pick more useful documents for question answering.
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
This paper dissects why "learning from verifiable rewards" can improve math reasoning even when rewards look noisy or misleading. It shows how clipping and reward noise reduce the model’s randomness in useful ways and offers principles for designing better reasoning-focused training runs. ([ar5iv.org](https://ar5iv.org/abs/2512.16912))
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
InternGeometry is a geometry-solving LLM agent that reaches medalist-level performance on IMO geometry problems by tightly integrating with a symbolic engine. It proposes auxiliary constructions and propositions, verifies them symbolically, reflects on the feedback, and is trained with a complexity-boosting RL curriculum—achieving 44/50 problems solved using a tiny fraction of the data required by AlphaGeometry 2.