UnderthinkingBench: LLM Reasoning Evaluation
- UnderthinkingBench is a benchmarking framework that quantifies shallow reasoning in LLMs by measuring premature chain-of-thought termination on complex, multi-step tasks.
- The evaluation protocol employs novel metrics like Overthinking-Adjusted Accuracy and unified F₁, revealing a trade-off between computational efficiency and reasoning depth.
- Mitigation strategies such as explicit prompting, dynamic routing, and adaptive inference are explored to advance optimal chain-of-thought modulation in language models.
UnderthinkingBench is a benchmarking framework and evaluation protocol expressly designed to probe and quantify underthinking phenomena in LLMs. Underthinking refers to a failure of reasoning depth: when models produce insufficiently extended chains-of-thought (CoT) on tasks requiring multi-step computational reasoning. The UnderthinkingBench sub-benchmark, as articulated in "OptimalThinkingBench: Evaluating Over and Underthinking in LLMs" (Aggarwal et al., 18 Aug 2025), provides a rigorous standard for evaluating reasoning models on challenging problem instances, drawing a sharp contrast to overthinking—where models waste tokens on unnecessary reasoning for simple tasks. UnderthinkingBench’s careful design, domain coverage, and novel metrics now serve as reference points for both academic research and practical model development.
1. Conceptual Overview and Definition
Underthinking, for the purposes of UnderthinkingBench, is the phenomenon where LLMs halt their internal reasoning processes too early or provide overly shallow explanations on complex reasoning problems—especially when a sequence of intermediate steps is necessary for a correct solution. This is distinct from token-level brevity or answer correctness. An underthinking LLM may generate concise but ultimately incorrect responses, missing out on vital computational or logical operations embedded in the task.
In practice, underthinking commonly affects non-thinking LLMs—models that eschew chain-of-thought generation in favor of direct answers. However, the phenomenon is not restricted to non-thinking models, as certain architectural or prompting regimes may also cause premature termination of reasoning in models otherwise capable of deep thinking. The significance is clear: absent sufficient reasoning, LLMs systematically underperform on tasks requiring compositionality, intermediate reflection, or multi-step computation.
2. Benchmark Composition and Task Selection
UnderthinkingBench consists of 11 reasoning-intensive task types across multiple domains, selected to stress models’ ability for thoughtful step-by-step inference. The domains include:
- Games (maze navigation, knight swap, puzzle 24, tsumego)
- Algorithms (ab-style pattern recognition, letter counting)
- Graphs (quantum locks: deducing shortest paths with explicit CoT)
- Arithmetic (bitwise operations, fraction simplification)
- Geometry and Logic (advanced geometry, propositional logic)
Each task is generated procedurally using the Reasoning Gym framework, with 50 instances per task for comprehensive coverage (total 550 examples). The benchmarks are deliberately chosen so that small thinking models—LLMs capable of explicit chain-of-thought—outperform much larger non-thinking LLMs. This construction ensures that underthinking penalties are strongly surfaced.
UnderthinkingBench is paired with OverthinkingBench (72 simple, domain-diverse queries) to form the OptimalThinkingBench unified benchmark, capturing the dual aspects of computational efficiency and reasoning depth.
3. Evaluation Metrics: Standard and Novel
Standard accuracy (fraction of correct responses) is applied to UnderthinkingBench, focusing on outright solution correctness. On the unified benchmark, thinking-adjusted metrics are introduced to penalize answer correctness achieved with excessive or insufficient token use.
A formal metric is the Overthinking-Adjusted Accuracy (OAA) on the OverthinkingBench, computed for a given threshold of token budget:
Aggregated as area under the curve:
The unified F₁ metric on OptimalThinkingBench is defined by:
where is the accuracy on UnderthinkingBench.
This suite of metrics forces models to be simultaneously efficient (on OverthinkingBench) and sufficiently deep (on UnderthinkingBench).
4. Experimental Results Across Models
Thirty-three models, encompassing both open and proprietary LLMs and both thinking/non-thinking variants, were systematically evaluated. Major findings:
- No existing LLM achieves jointly optimal performance on simple and complex tasks.
- Thinking LLMs (with explicit CoT output) greatly outperform non-thinking LLMs on UnderthinkingBench but tend to over-generate tokens for trivial queries.
- Non-thinking models approach 100% accuracy on simple factual OverthinkingBench items, yet consistently fail on complex UnderthinkingBench queries.
- Only five models exceed a 50% unified F₁ score. The proprietary model o3 leads at 72.7%; GPT-OSS-120B is the strongest open-weight model at 62.5%.
These results demonstrate an intrinsic trade-off: computational efficiency on simple tasks versus reasoning capability on hard tasks. Current LLMs are unable to modulate reasoning depth contextually without explicit mechanisms.
5. Mitigation Strategies and Trade-Offs
Several approaches to minimizing model underthinking are discussed and empirically tested:
- Length-based reward shaping: Training/fine-tuning with token-length penalties for excessive output. This constrains overthinking but may induce underthinking on harder tasks.
- Dynamic routing: Use of trained routers to select thinking or non-thinking mode based on predicted task difficulty. Routers increase overall performance (up to 11% absolute F₁ improvement) but do not match the performance of oracle selectors.
- Explicit prompting: “Don’t overthink” and “step-by-step” prompts adjust token usage, but improper application can either worsen underthinking or exacerbate overthinking.
- Adaptive inference: Mechanisms such as entropy-based stopping (see (Yong et al., 23 May 2025)) halt reasoning when confidence thresholds are met, reducing underthinking.
These mitigation methods yield incremental improvements, each with trade-offs: gains on OverthinkingBench may come at the expense of UnderthinkingBench, and vice versa, illustrating the fundamental difficulty of the optimal reasoning challenge.
6. Broader Implications and Future Directions
UnderthinkingBench, as part of OptimalThinkingBench, exposes critical deficiencies in contemporary LLMs: none can simultaneously modulate reasoning depth and computational efficiency in context. A plausible implication is that unified benchmarks and routing strategies will be necessary for API-based deployment, cost control, and robust problem-solving in real-world applications.
Research directions proposed include:
- Improved dynamic routing and adaptive inference, potentially leveraging entropy or calibration signals.
- Training objectives that jointly optimize accuracy and token cost.
- Benchmarks and metrics that explicitly capture underthinking and overthinking both for training and evaluation.
- Algorithmic advances in chain-of-thought modulation and self-assessment.
UnderthinkingBench’s design and empirical demonstrations strongly motivate the development of LLMs capable of optimal, context-sensitive reasoning processes.
7. Summary Table: UnderthinkingBench Features and Findings
Feature | Description | Source Example |
---|---|---|
Task Domains | Games, algorithms, graphs, arithmetic, geometry, logic | (Aggarwal et al., 18 Aug 2025) |
Number of Tasks | 11 challenging reasoning types, 550 instances | (Aggarwal et al., 18 Aug 2025) |
Evaluation Metric | Accuracy; unified F₁ metric | (Aggarwal et al., 18 Aug 2025) |
Key Finding | No LLM balances underthinking and overthinking optimally | (Aggarwal et al., 18 Aug 2025) |
Mitigation Methods | Reward shaping, routing, prompting, adaptive inference | (Aggarwal et al., 18 Aug 2025, Yong et al., 23 May 2025) |
UnderthinkingBench thus constitutes a rigorous, programmatic tool for probing the limits of reasoning depth in LLMs, shaping both the methodology and objectives for future LLM research. Its structure provides a foundation for the development and assessment of models able to efficiently allocate computational resources in accordance with the complexity of their queries.