Reasoning LLMs spend more tokens on failures than successes, study finds

Source: arXiv (via 24 AI citation)

Read original

TL;DR

AI-Summarizedfrom 2 sources

On June 28, 2026, 24 AI summarized a new arXiv paper (2606.26502) by Han‑yu Wang showing that large reasoning models expend more tokens on tasks they ultimately get wrong than on those they solve, in sharp contrast to human behavior on the same benchmarks. The study measures this effect across multiple models on the H‑ARC benchmark, finding large effect sizes (Cohen’s d 1.47–3.13).

About this summary

This article aggregates reporting from 2 news sources. The TL;DR is AI-generated from original reporting. Race to AGI's analysis provides editorial context on implications for AGI development.

2 sources covering this story

Race to AGI Analysis

This paper pokes at a comforting intuition many people have about “thinking harder”: that if a model spends more tokens reasoning, it must be more likely to get the answer right. At least on H‑ARC, the opposite seems true for current large reasoning models—they pour more tokens into the problems they end up failing, whereas humans tend to disengage on the hardest items. That suggests that today’s chain‑of‑thought heuristics are more like brute‑force exploration than calibrated metacognition. The model doesn’t really know when it’s out of its depth; it just keeps talking. ([24-ai.news](https://24-ai.news/en/?utm_source=openai))

For the race to AGI, this matters because many safety and capability proposals hinge on “let it think more on the hard stuff.” If extra tokens aren’t accompanied by better self‑assessment, we risk building systems that sound more confident exactly when they’re most wrong. The work also hints that we may need explicit mechanisms—like confidence estimators, external verifiers, or learned stopping rules—to get closer to human‑like effort allocation. In a world where models can autonomously act, the ability to recognize when to stop, escalate, or defer to a human is as important as raw accuracy.

The broader takeaway is that AGI won’t emerge simply by cranking up context windows and reasoning budgets. We also need architectures and training objectives that produce something closer to introspective judgment about when more “thinking tokens” are actually helping versus just hallucinating at greater length.