d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))
Leyi Pan, Shuchang Tao