Reasoning
Research papers, repositories, and articles about reasoning
Showing 42 of 42 items
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.
STEP3-VL-10B Technical Report
STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.
SakanaAI/AI-Scientist-v2
Implements AI Scientist v2, which runs agentic tree search over experiments. Pushes toward semi-automated scientific discovery instead of just paper drafting.
Reasoning Models Generate Societies of Thought
This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
This work changes reinforcement learning for LLMs to reward correct but uncommon solution strategies, not just the first one that works. That raises pass@k without tanking single-answer performance, which matters if you sample multiple candidates.
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
This paper trains a retriever to select past reasoning traces that actually help solve a new problem, then uses those traces during reinforcement-based customization. On hard math benchmarks like AIME, their analogy-aware method beats standard reinforcement setups by several points, showing that reasoning-aware retrieval is a real lever.
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.
ARC Prize 2025: Technical Report
This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))
TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration
TIDE turns an AI agent from a reactive helper into a proactive bug hunter. It iteratively surfaces hidden problems in docs and code, using reusable "thought templates" to ground each finding in real evidence. If you run agents over large workspaces, this suggests how to search for issues users haven’t even noticed yet.
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
VideoKR builds a massive dataset of expert-domain videos paired with hard reasoning questions. Models trained here must combine visual understanding with background knowledge, not just pattern-match frames. This is a key testbed if you care about video models that can actually reason about real-world events.
VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
VideoKR gives you 315k tough reasoning questions over 145k expert videos. It’s built to push models beyond captioning toward real multi-step explanations. Use it to pressure-test any video model that claims "understanding" rather than just pattern matching.
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.
SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.
ART: Adaptive Reasoning Trees for Explainable Claim Verification
ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling
HGMem turns the “scratchpad” of a multi-step retrieval system into a hypergraph that connects many related facts at once. This richer memory structure helps language models keep global context straight over long tasks, boosting performance on challenging reasoning and long-document benchmarks.
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Describes the QwenLong-L1.5 post-training recipe for extending LLM context windows while keeping reasoning quality intact. The work focuses not just on positional encodings but also on memory management strategies and training curricula that keep long-context performance from collapsing. This is highly relevant for anyone trying to turn a baseline LLM into a stable long-context model without re‑training from scratch.
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.
TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
TIME teaches dialogue models to drop short "thinking" blocks only when time gaps or context shifts actually demand deeper reasoning. Models keep answers compact while still reasoning hard when conversations get tricky or span days instead of seconds.
Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
Defines four testable criteria for what a "good" internal thought representation should satisfy, separate from task scores. Finds that current models systematically fail these tests. If you probe activations or build latent-thought pipelines, this gives a sharper evaluation target. ([arxiv.org](https://arxiv.org/list/cs.CL/new))
Language-Guided Abstraction for Visual Reasoning
This paper tackles ARC-style abstract reasoning by adding a language branch that refines human-written task descriptions with a large model. The system uses those cleaned descriptions to guide a compact 18M-parameter visual model, outperforming prior ARC methods while keeping the final model lightweight.
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
This paper shows why some learning-from-verifiable-feedback methods push models toward bloated answers. The authors fix the loss so you can improve reasoning without secretly optimizing for longer outputs.
Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
The authors find sparse "circuits" inside language models that drive math reasoning and selectively strengthen only those pieces. They report up to 11.4% accuracy gains while touching about 1.6% of model components, keeping other skills like MMLU almost unchanged. ([ar5iv.org](https://ar5iv.org/abs/2512.16914))
d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))
Thinking with Images via Self-Calling Agent
Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
Lets language models write Answer Set Programs, then uses feedback from a symbolic solver to iteratively fix their code. Shows this combo handles default rules and exceptions better than standard constraint solvers on diverse logic tasks. If you are building reasoning-heavy agents, this is a concrete recipe for bolting on symbolic reliability. ([arxiv.org](https://arxiv.org/abs/2604.27960))
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.
Reverse Thinking Enhances Missing Information Detection in Large Language Models
Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
This paper dissects why "learning from verifiable rewards" can improve math reasoning even when rewards look noisy or misleading. It shows how clipping and reward noise reduce the model’s randomness in useful ways and offers principles for designing better reasoning-focused training runs. ([ar5iv.org](https://ar5iv.org/abs/2512.16912))
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
The Daily Papers summary underlines how reward clipping and entropy tricks interact in RL for reasoning. Read this before you copy any popular reward setups for math models.
Thinking with Images via Self-Calling Agent
Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))
MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy
Shows that explicit chain-of-thought often hurts emotion recognition compared with quick answers from the same model. Introduces a training setup that combines "fast" and "slow" heads so they work together instead of fighting. If you build emotional or social agents, this is a blueprint for more stable behavior. ([arxiv.org](https://arxiv.org/list/cs.AI/new))
MMhops-R1: Multimodal Multi-hop Reasoning
Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.
RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning
Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.
Error-Driven Prompt Optimization for Arithmetic Reasoning
Targets the surprisingly hard problem of getting small on‑prem LLMs to do reliable arithmetic over tabular data in regulated environments. The authors propose an error-driven loop that clusters the model’s wrong answers, derives new prompt rules to address those failure modes, and iteratively refines a code-generation agent. On a finance-style deployment with a 4B-parameter model, this strategy reportedly boosts arithmetic accuracy to around 70% while keeping all computation inside the secure environment.
LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning
The authors use token-level uncertainty to decide when an LLM should think longer in games like tic-tac-toe. Low entropy means short context and reasoning, high entropy triggers more examples and multiple reasoning paths.
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.