Rl
Research papers, repositories, and articles about rl
Showing 24 of 24 items
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
Shows that in tool-use RL, models often "forget" how to call tools because specific control tokens spike in probability, breaking format while the underlying skill stays. Interleaving supervised updates with RL and adding richer hints stabilizes training across formats and tasks. If your agent RL runs keep collapsing, this paper is a playbook. ([huggingface.co](https://huggingface.co/papers/2606.26027))
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
The authors design a reward scheme that scores agents on how well they build evidence chains with proper citations, not just final answers. Their new training method reduces shortcut tricks and hallucinated claims, so deep research agents behave more like careful analysts.
Discretizing Reward Models
Shows that continuous reward models often assign very different scores to equally good answers, which encourages reward hacking and bad policies. Clustering rewards into a few discrete levels using Monte Carlo dropout reduces this oversensitivity and leads to better RL outcomes. If you're training policies on reward models, this is a strong argument to discretize. ([huggingface.co](https://huggingface.co/papers/2606.21795))
Reinforcement World Model Learning for LLM-based Agents
The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Argues you can reuse the policy and reference from RL post-training to define a "progress advantage" signal instead of training a separate process reward model. This gives dense step-wise scores for agents while avoiding another fragile model in the loop. If you're drowning in reward-model complexity, this suggests a cheaper alignment path. ([huggingface.co](https://huggingface.co/papers/2606.26080))
Rethinking Agentic Reinforcement Learning In Large Language Models
Synthesizes the fast-growing literature on reinforcement learning for agent-style language models, from environment design to safety and compute limits. Argues the key shift is treating models as long-lived decision-makers, not one-shot text generators. If you’re planning big training runs for agents, use this as a design checklist, not just a citation. ([databubble.co](https://databubble.co/news/rethinking-agentic-reinforcement-learning-in-large-language-models?utm_source=openai))
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
Surveys how teams use reinforcement learning plus GUI interaction to push beyond simple desktop macros into always-on "digital inhabitants". Breaks the space into offline, online, and hybrid strategies, and highlights trends like world-model training and process-level rewards. If you’re automating real GUI workflows, treat this as a roadmap, not just a survey. ([arxiv.org](https://arxiv.org/abs/2604.27955))
Reinforcement World Model Learning for LLM-based Agents
RWML trains agents to imagine next states and then line them up with reality, instead of just predicting the next token. That shift gives stronger gains on text-based environments than reward-on-final-score alone.
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.
BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
BAPO trains search-based agents not just to answer, but to know when to say "I don't know". It adds special rewards that encourage honest uncertainty without letting agents abuse that response to duck work.
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
MatchTIR stops treating every step in a tool-using trajectory equally. It uses bipartite matching to match predicted tool traces to gold traces, then assigns rewards per step, making small models competitive with larger ones on long tool-use tasks.
MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization
MaxCode treats code optimization as a reinforcement learning search over code edits guided by runtime feedback. It uses natural-language critiques and a reward model to steer generation, beating past systems at speeding up CUDA and C++ kernels.
WildSci: Advancing Scientific Reasoning from In-the-Wild Literature
WildSci builds a large question set from real scientific papers across many fields, then uses reinforcement learning to sharpen models’ scientific reasoning. It moves science QA beyond toy benchmarks and gives labs a more realistic way to stress-test research assistants.
d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
The authors release LMEE-Bench to test how agents explore and remember in long-horizon 3D tasks. Their MemoryExplorer method trains a vision-language model with reinforcement learning to actively query and use episodic memory.
SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning
Argues that current task-oriented agents are over-optimized as passive followers and under-use conversation as an action. SpeakRL introduces a reinforcement-learning setup that rewards models for asking clarifying questions when the user’s intent is ambiguous, balancing ‘asking’ vs ‘acting’. On synthetic task-oriented dialogue scenarios, the trained agents substantially improve task completion rates without bloating the number of turns, suggesting that proactive clarification is a powerful, underused control knob.
Thinking with Images via Self-Calling Agent
Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))
Co-Evolving Policy Distillation
Unifies two popular post‑training styles and shows why naively merging many expert policies can lose capabilities. Proposes a bidirectional distillation loop where student and experts improve together. If you juggle multiple specialist models, this offers a more stable way to fold them into one. ([huggingface.co](https://huggingface.co/papers/2604.27083))
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
HF frames Fed-SE as a way to let LLM agents "self-evolve" across different clients and environments without sharing raw trajectories. For people deploying agents in regulated or siloed settings, it’s an interesting recipe for federated RL that reduces gradient conflicts across heterogeneous tasks. ([huggingface.co](https://huggingface.co/papers/2512.08870))
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
Fed-SE is a federated learning framework for LLM agents that must improve across heterogeneous environments under strict privacy constraints. It combines local parameter-efficient fine-tuning on high-return trajectories with global aggregation in a low-rank subspace, reducing negative transfer and boosting average success rates by ~18% over federated baselines. ([huggingface.co](https://huggingface.co/papers/2512.08870))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))
huggingface/OpenEnv
OpenEnv is an interface library for training and evaluating reinforcement-learning style agents across many environments. It targets post-training, giving a cleaner way to plug modern models into classic RL-style tasks.