OptimalThinkingBench Benchmark
- OptimalThinkingBench is a unified evaluation framework for LLMs that balances computational efficiency with accurate reasoning across diverse tasks.
- It employs two sub-benchmarks, OverthinkingBench and UnderthinkingBench, to quantitatively measure excess reasoning and insufficient inference respectively.
- Empirical findings reveal that current LLMs struggle to adapt reasoning depth based on task complexity, underscoring the need for dynamic, adaptive architectures.
OptimalThinkingBench is a unified benchmark designed to evaluate and encourage the development of LLMs that dynamically balance the depth of their reasoning with task difficulty, striving to minimize "overthinking" on simple queries while avoiding "underthinking" on complex reasoning problems. The benchmark explicitly quantifies both the computational inefficiency from superfluous intermediate reasoning and the accuracy loss from insufficient reasoning, providing a principled framework, novel thinking-adjusted metrics, and a comprehensive empirical evaluation over 33 models. This approach directly addresses the need for LLMs that exhibit “optimal thinking”—adapting their cognitive effort in response to input complexity—rather than statically committing to either a chain-of-thought (“thinking”) or direct “non-thinking” mode (Aggarwal et al., 18 Aug 2025).
1. Motivation and Problem Formulation
OptimalThinkingBench arises from a core tension observed in recent LLM architectures and prompting strategies. So-called “thinking models”—those that systematically generate chains of intermediate reasoning tokens (e.g., via step-by-step chain-of-thought or scratchpad style outputs)—yield substantial gains on difficult reasoning tasks but frequently overthink on trivial problems, incurring hundreds of unnecessary tokens per query. Conversely, non-thinking LLMs, which directly produce answers without intermediary reasoning, are much more efficient on simple inputs but underperform on tasks requiring advanced inference.
This bifurcation has led to the proliferation of separate “thinking” and “non-thinking” LLM variants and the practical burden of model selection being placed on the end user. OptimalThinkingBench is introduced as the first rigorous testbed that measures both deficits—overthinking and underthinking—in LLMs, with the explicit aim of fostering models and protocols that allocate computational resources adaptively.
2. Composition and Methodology
OptimalThinkingBench is structured around two complementary sub-benchmarks that, when scored jointly, quantify the optimality of a model’s thinking strategy:
- OverthinkingBench consists of simple user queries (factual, arithmetic, short-form QA, etc.) spanning 72 distinct domains and four answer types (numeric, multiple-choice, short answer, open-ended). Each query is programmatically generated with rigorous filtering by an LLM-as-a-Judge, guaranteeing clarity, uniqueness, and alignment with the “should not need much reasoning” ethos.
- UnderthinkingBench comprises 11 difficult reasoning challenges (sourced from games, graph problems, algorithms, and arithmetic) constructed to elicit clear performance gaps between models with explicit step-wise reasoning versus those without. Verification mechanisms are fully programmatic (e.g., running code for algorithmic validation), ensuring that correctness is judged independently of output style.
The dataset generation for both sub-benchmarks employs a mixture of constraint-based sampling, synthetic query templating, and automatic answer validation, resulting in problems tailored to probe each form of thinking inefficiency.
3. Thinking-Adjusted Metrics
A technical innovation of the benchmark is the Overthinking-Adjusted Accuracy (OAAₜ), defined as:
where denotes binary scoring of the answer for sample , the number of intermediary reasoning tokens, and a variable threshold. The primary summary measure is the area under the OAAₜ curve (AUC) as varies up to a chosen maximum.
For UnderthinkingBench, standard task accuracy (Acc) measures the fraction of correct answers on difficult reasoning challenges.
The benchmark’s principal score, termed the “optimal-thinking F₁” (F₁), jointly aggregates both as:
This metric directly codifies the efficiency–performance trade-off, heavily penalizing models that excel in only one regime.
4. Experimental Findings
Extensive experiments with 33 models reveal fundamental limitations:
- Thinking models (chain-of-thought, reasoning-optimized) excel on UnderthinkingBench but frequently generate 100–1000+ thinking tokens on even trivial OverthinkingBench examples, degrading efficiency and increasing inference cost without accuracy gains.
- Non-thinking models produce concise outputs (few or no intermediate tokens) and maximize speed and cost on simple inputs but systematically underperform smaller “thinking” models on UnderthinkingBench, failing on multi-step compositional queries.
- No current model displays near-optimal F₁ performance: only five models surpass 50% on the unified metric, with the closed-weight "o3" variant at 72.7% and GPT-OSS-120B at 62.5% among open models.
- The primary cause is the inability to adapt reasoning granularity based on input complexity—choices that optimize for one sub-benchmark predictably harm the other.
5. Approaches to Encouraging Optimal Thinking
Several prototypes are evaluated for their ability to foster optimal thinking:
Method Type | Effect on OverthinkingBench | Effect on UnderthinkingBench | Net Outcome |
---|---|---|---|
Length-based Reward Shaping | Reduces token overuse | Degrades reasoning accuracy | No net F₁ gain |
Router (difficulty predictor) | Improves efficiency on easy | May skip reasoning on hard | Gains offset by accuracy decline |
Explicit Prompt Strategies | Some efficiency | Some decrease in hard tasks | Similar trade-off |
Although all approaches yield some improvement in one direction, performance on the complementary sub-benchmark suffers. This suggests the absence of a “silver bullet”; simple modifications to decouple reasoning depth from task difficulty are insufficient in current architectures or training setups.
6. Implications and Future Directions
OptimalThinkingBench illuminates an architectural and training gap: dynamic reasoning allocation. Bridging this gap will require:
- Unified architectures or control mechanisms that predict the necessary degree of reasoning for each query and allocate computation accordingly.
- Multi-objective training frameworks that jointly optimize for low overthinking and high complex-task accuracy.
- Further advances in synthetic query generation, difficulty prediction, and answer validation to ensure robustness as model capabilities evolve.
The benchmark’s metrics and testbed will remain relevant as new models are developed, providing both a diagnostic and a direct optimization target for researchers attempting to realize truly “optimal thinking” in LLMs.
7. Significance for Research and Practice
OptimalThinkingBench establishes a systematic, interpretable standard for evaluating and developing LLMs that are both computationally efficient and robust in reasoning. By formalizing and jointly measuring overthinking and underthinking, it provides actionable guidance for model selection, routing, and new system design, ultimately aiming to unburden end users from the task of choosing "thinking" versus "non-thinking" variants. Its metrics allow quantitative benchmarking as new dynamic reasoning strategies emerge, thus advancing the broader goal of practical, cost-effective, and adaptive AI reasoning (Aggarwal et al., 18 Aug 2025).