ArXiv Paper

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Peter Holderrieth, Douglas Chen, Luca Eyring +7February 6, 2026

Summary

Diamond Maps reframe reward alignment as learning a transport map over model outputs instead of tweaking rewards token by token. This gives smoother, more sample-efficient updates and shows strong results across safety-style alignment tasks.

Topics

alignment rl optimization

View Original View PDF

Related Content

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5 is a small family of guardrail models trained on a detailed agent-risk taxonomy with surprisingly few samples. They can sit in front of powerful agents, flag dangerous actions, and run cheaply. If you build tool-using agents, this is emerging as a standard safety baseline to copy or test against.

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))