Discretizing Reward Models
Shows that continuous reward models often assign very different scores to equally good answers, which encourages reward hacking and bad policies. Clustering rewards into a few discrete levels using Monte Carlo dropout reduces this oversensitivity and leads to better RL outcomes. If you're training policies on reward models, this is a strong argument to discretize. ([huggingface.co](https://huggingface.co/papers/2606.21795))
Vijay Viswanathan, Shiqi Wang