Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Summary
Trains LLM agents to "imagine" future states and score plans, not just react step by step. They use a three-stage pipeline to inject forecasting, format it, then harden it with reinforcement learning. If you build agents that plan, this is a concrete recipe for giving them an internal world model. ([arxiv.org](https://arxiv.org/list/cs.AI/new))
Related Content
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
OPV (Outcome-based Process Verifier) is a verifier model that inspects the rationale steps of long chains-of-thought via summarized outcomes, combining the strengths of outcome-based and process-based verification. Trained with an active learning loop, rejection fine-tuning, and RLVR, OPV reaches strong F1 on OPV-Bench and outperforms much larger models like Qwen3-Max-Preview at detecting reasoning errors.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
This work presents a long-horizon reasoning agent for Olympiad-level math that uses an Outcome-based Process Verifier (OPV) to supervise and clean up very long chains-of-thought. By summarizing and checking reasoning segments rather than only final answers, and training OPV via iterative active learning and RLVR, the system achieves new SOTA on a held-out benchmark while reducing annotation cost.
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
T-pro 2.0 is an open-weight Russian large language model focused on hybrid reasoning: it can answer directly or emit explicit reasoning traces, and it’s optimized for low-latency inference via speculative decoding. Alongside the model, the authors release a Russian instruction corpus, a math benchmark, and an EAGLE-based inference stack, making it a practical foundation for Russian-language reasoning applications.
huggingface/transformers
The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.