Rl

Research papers, repositories, and articles about rl

Showing 7 of 7 items

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))

Leyi Pan, Shuchang Tao

SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

Argues that current task-oriented agents are over-optimized as passive followers and under-use conversation as an action. SpeakRL introduces a reinforcement-learning setup that rewards models for asking clarifying questions when the user’s intent is ambiguous, balancing ‘asking’ vs ‘acting’. On synthetic task-oriented dialogue scenarios, the trained agents substantially improve task completion rates without bloating the number of turns, suggesting that proactive clarification is a powerful, underused control knob.

Emre Can Acikgoz, Jinoh Oh

Thinking with Images via Self-Calling Agent

Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))

Wenxi Yang, Yuzhong Zhao

Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

Fed-SE is a federated learning framework for LLM agents that must improve across heterogeneous environments under strict privacy constraints. It combines local parameter-efficient fine-tuning on high-return trajectories with global aggregation in a low-rank subspace, reducing negative transfer and boosting average success rates by ~18% over federated baselines. ([huggingface.co](https://huggingface.co/papers/2512.08870))

Xiang Chen, Yuling Shi

Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

HF frames Fed-SE as a way to let LLM agents "self-evolve" across different clients and environments without sharing raw trajectories. For people deploying agents in regulated or siloed settings, it’s an interesting recipe for federated RL that reduces gradient conflicts across heterogeneous tasks. ([huggingface.co](https://huggingface.co/papers/2512.08870))

Xiang Chen, Yuling Shi

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA gets called out as a way to align role-playing agents along many competing dimensions simultaneously using multi-objective RL and thought-augmented rollouts. It’s especially relevant if you’re trying to get smaller models to behave like premium chatbots in complex, persona-heavy domains. ([huggingface.co](https://huggingface.co/papers/2512.09756))

Chonghua Liao, Ke Wang