Back to AI Lab

Reasoning

Research papers, repositories, and articles about reasoning

Showing 42 of 42 items

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.

Seokju Cho, Ryo Hachiuma

STEP3-VL-10B Technical Report

STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.

Ailin Huang, Chengyuan Yao

SakanaAI/AI-Scientist-v2

Implements AI Scientist v2, which runs agentic tree search over experiments. Pushes toward semi-automated scientific discovery instead of just paper drafting.

4,938

Reasoning Models Generate Societies of Thought

This work argues that strong reasoning models behave like small societies of internal agents with different "personalities" and expertise. Diversity and internal debate, not just longer chains of thought, drive their gains.

Junsol Kim, Shiyang Lai

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

This work changes reinforcement learning for LLMs to reward correct but uncommon solution strategies, not just the first one that works. That raises pass@k without tanking single-answer performance, which matters if you sample multiple candidates.

Zhiyuan Hu, Yucheng Wang

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

This paper trains a retriever to select past reasoning traces that actually help solve a new problem, then uses those traces during reinforcement-based customization. On hard math benchmarks like AIME, their analogy-aware method beats standard reinforcement setups by several points, showing that reasoning-aware retrieval is a real lever.

Zilin Xiao, Qi Ma

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.

Kewei Xu, Xiaoben Lu

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Trains a 32B-parameter code model on synthetic “thinking traces” and hardware execution logs. Targets chip design, GPU tuning, and embedded code with explicit reasoning steps.

Jian Yang, Wei Zhang

ARC Prize 2025: Technical Report

This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.

François Chollet, Mike Knoop

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

A "reasoner" model and a "discriminator" model train together so the discriminator flags wrong steps in math solutions, not just wrong final answers. This joint training gives dense step-level rewards and boosts math benchmark scores for existing open models like DeepSeek-R1 distills without huge extra compute. ([ar5iv.org](https://ar5iv.org/abs/2512.16917))

Qihao Liu, Luoxin Ye

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

TIDE turns an AI agent from a reactive helper into a proactive bug hunter. It iteratively surfaces hidden problems in docs and code, using reusable "thought templates" to ground each finding in real evidence. If you run agents over large workspaces, this suggests how to search for issues users haven’t even noticed yet.

Soyeong Jeong, Jinheon Baek

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR builds a massive dataset of expert-domain videos paired with hard reasoning questions. Models trained here must combine visual understanding with background knowledge, not just pattern-match frames. This is a key testbed if you care about video models that can actually reason about real-world events.

Lin Fu, Zheyuan Yang

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR gives you 315k tough reasoning questions over 145k expert videos. It’s built to push models beyond captioning toward real multi-step explanations. Use it to pressure-test any video model that claims "understanding" rather than just pattern matching.

Lin Fu, Zheyuan Yang

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Builds a massive graph of how AI methods evolve across 1M+ papers, with over 9M typed edges between techniques. Lets agents and humans trace method lineages, score idea novelty, and auto-generate new research directions. If you scout research or design AI research agents, treat this as a new data layer, not just another paper.

Yujun Wu, Dongxu Zhang

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT lets a model "pause" and think over selected spans instead of every token. It aims to keep reasoning strong while cutting compute and revealing what the model is thinking about.

Yunlong Chu, Minglai Shao

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

ART: Adaptive Reasoning Trees for Explainable Claim Verification

ART makes models verify claims by building explicit argument trees instead of spitting out one opaque chain of thought. That structure lets a judge model compare supporting and attacking evidence, making fact-checking more transparent and easier to audit.

Sahil Wadhwa, Himanshu Kumar

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

HGMem turns the “scratchpad” of a multi-step retrieval system into a hypergraph that connects many related facts at once. This richer memory structure helps language models keep global context straight over long tasks, boosting performance on challenging reasoning and long-document benchmarks.

Chulun Zhou, Chunkang Zhang

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Describes the QwenLong-L1.5 post-training recipe for extending LLM context windows while keeping reasoning quality intact. The work focuses not just on positional encodings but also on memory management strategies and training curricula that keep long-context performance from collapsing. This is highly relevant for anyone trying to turn a baseline LLM into a stable long-context model without re‑training from scratch.

Weizhou Shen, Ziyi Yang

NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

Replaces mixture-of-experts with a finer-grained mixture-of-neurons for reasoning tasks. The goal is more interpretable and steerable thinking steps.

Haonan Dong, Kehan Jiang

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

This paper trains small reasoning models with rewards that check whether each intermediate step actually follows from earlier ones. That reduces reward hacks where the model spews long but logically broken chains of thought.

Shuo Nie, Hexuan Deng

TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning

TIME teaches dialogue models to drop short "thinking" blocks only when time gaps or context shifts actually demand deeper reasoning. Models keep answers compact while still reasoning hard when conversations get tricky or span days instead of seconds.

Susmit Das

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Defines four testable criteria for what a "good" internal thought representation should satisfy, separate from task scores. Finds that current models systematically fail these tests. If you probe activations or build latent-thought pipelines, this gives a sharper evaluation target. ([arxiv.org](https://arxiv.org/list/cs.CL/new))

Fahd Seddik, Fatemeh Fard

Language-Guided Abstraction for Visual Reasoning

This paper tackles ARC-style abstract reasoning by adding a language branch that refines human-written task descriptions with a large model. The system uses those cleaned descriptions to guide a compact 18M-parameter visual model, outperforming prior ARC methods while keeping the final model lightweight.

Xu-Jing Ye, Yuan-Gen Wang

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

This paper shows why some learning-from-verifiable-feedback methods push models toward bloated answers. The authors fix the loss so you can improve reasoning without secretly optimizing for longer outputs.

Fanfan Liu, Youyang Yin

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

The authors find sparse "circuits" inside language models that drive math reasoning and selectively strengthen only those pieces. They report up to 11.4% accuracy gains while touching about 1.6% of model components, keeping other skills like MMLU almost unchanged. ([ar5iv.org](https://ar5iv.org/abs/2512.16914))

Nikhil Prakash, Donghao Ren

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Targets RL for diffusion LLMs by introducing d-TreeRPO, which uses tree-structured rollouts and bottom-up advantage computation with verifiable outcome rewards for fine-grained credit assignment. The method also adds a time-scheduled self-distillation loss to improve probability estimates, yielding large gains on Sudoku, Countdown, GSM8K, and Math500 over existing RL baselines. ([arxiv.org](https://arxiv.org/abs/2512.09675?utm_source=openai))

Leyi Pan, Shuchang Tao

Thinking with Images via Self-Calling Agent

Introduces sCoT, where a main language agent delegates visual subtasks to self-calling subagents rather than running a fully interleaved multimodal CoT. This makes high-resolution visual reasoning more data- and compute-efficient while still beating strong baselines on HR-Bench and related multimodal benchmarks. ([huggingface.co](https://huggingface.co/papers/2512.08511))

Wenxi Yang, Yuzhong Zhao

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Lets language models write Answer Set Programs, then uses feedback from a symbolic solver to iteratively fix their code. Shows this combo handles default rules and exceptions better than standard constraint solvers on diverse logic tasks. If you are building reasoning-heavy agents, this is a concrete recipe for bolting on symbolic reliability. ([arxiv.org](https://arxiv.org/abs/2604.27960))

Adam Ishay, Joohyung Lee

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wraps language models in a loop of self-critique and revision for long-form writing. Focuses on deeper reasoning, not just surface polish.

Wanlong Liu, Bo Zhang

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))

Yuxin Liu, Chaojie Gu

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

This paper dissects why "learning from verifiable rewards" can improve math reasoning even when rewards look noisy or misleading. It shows how clipping and reward noise reduce the model’s randomness in useful ways and offers principles for designing better reasoning-focused training runs. ([ar5iv.org](https://ar5iv.org/abs/2512.16912))

Peter Chen, Xiaopeng Li

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

The Daily Papers summary underlines how reward clipping and entropy tricks interact in RL for reasoning. Read this before you copy any popular reward setups for math models.

Peter Chen, Xiaopeng Li

Thinking with Images via Self-Calling Agent

Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))

Wenxi Yang, Yuzhong Zhao

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Shows that explicit chain-of-thought often hurts emotion recognition compared with quick answers from the same model. Introduces a training setup that combines "fast" and "slow" heads so they work together instead of fighting. If you build emotional or social agents, this is a blueprint for more stable behavior. ([arxiv.org](https://arxiv.org/list/cs.AI/new))

Zhiyuan Han, Beier Zhu

MMhops-R1: Multimodal Multi-hop Reasoning

Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.

Tao Zhang, Ziqi Zhang

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Looks for the best “control points” inside a model’s thinking steps where interventions actually stick. Goal: steer reasoning without wrecking quality.

Haomin Zhuang, Hojun Yoo

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

Using a NeurIPS data curation challenge, this paper shows that picking hard, aligned examples beats just adding more or more diverse data. For vision–language reasoning, curation quality matters more than dataset size.

Yosub Shin, Michael Buriek

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Optimizes which branches a graph-of-thoughts system actually runs. Cuts redundant reasoning steps while trying to keep answer quality similar.

Yuhang Liu, Ruijie Wang

Error-Driven Prompt Optimization for Arithmetic Reasoning

Targets the surprisingly hard problem of getting small on‑prem LLMs to do reliable arithmetic over tabular data in regulated environments. The authors propose an error-driven loop that clusters the model’s wrong answers, derives new prompt rules to address those failure modes, and iteratively refines a code-generation agent. On a finance-style deployment with a 4B-parameter model, this strategy reportedly boosts arithmetic accuracy to around 70% while keeping all computation inside the secure environment.

Árpád Pándy, Róbert Lakatos

LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

The authors use token-level uncertainty to decide when an LLM should think longer in games like tic-tac-toe. Low entropy means short context and reasoning, high entropy triggers more examples and multiple reasoning paths.

Tommaso Felice Banfi, Sashenka Gamage

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.

Chenrui Fan, Yijun Liang