OverthinkingBench Evaluation

Updated 19 August 2025

OverthinkingBench defines overthinking in LLMs as the generation of extraneous reasoning tokens on simple queries, serving as a measure of computational inefficiency.
It employs metrics such as Overthinking-Adjusted Accuracy and Area Under the Curve to compare thinking versus non-thinking model variants across 72 domains.
Empirical findings show that models generating minimal tokens achieve high efficiency on simple tasks, though they may underperform on complex reasoning challenges.

OverthinkingBench is a standardized sub-benchmark within the OptimalThinkingBench framework designed to evaluate and quantify overthinking in LLMs on simple user queries. Overthinking in this context refers to models generating unnecessary reasoning, redundant chain-of-thought (CoT) tokens, or excessive computation when answering queries for which concise, direct answers suffice. The benchmark provides detailed measurement criteria and supports comparative evaluation of both “thinking” and “non-thinking” model variants across 72 domains, and is typically paired with corresponding underthinking evaluations on more challenging tasks (Aggarwal et al., 18 Aug 2025).

1. Conceptual Background of OverthinkingBench

OverthinkingBench formalizes the assessment of LLMs’ tendency to allocate computational resources ineffectively on straightforward questions. Whereas “thinking models” yield lengthy CoTs and auxiliary tokens even for obvious factual, arithmetic, or basic procedural tasks, “non-thinking” models respond concisely and efficiently. The construction of OverthinkingBench queries ensures that correct performance can be achieved without extra reasoning—for instance, answering "What is 2+3?" should require minimal output.

The principal aim of OverthinkingBench is to highlight two phenomena:

Token Waste: Measuring how many unnecessary tokens a model produces beyond what is needed for a correct and concise answer.
Performance Penalty: Determining whether excessive reasoning actually degrades a model’s accuracy on simple queries.

This framework supports multi-dimensional analysis, complementing underthinking evaluations (accuracy and robustness on complex reasoning tasks) to encourage development of models that optimally allocate computational resources per task.

2. Structure, Data, and Domains

OverthinkingBench comprises thousands of simple queries distributed across 72 distinct domains. Representative categories include arithmetic, basic factual recall, common sense queries, unit conversions, and other tasks where ground-truth answers are unambiguously defined and can be achieved via direct decoding.

The datasets are curated so that non-thinking model variants exhibit high raw accuracy, whereas thinking models, by default, produce extended answer traces—chain-of-thought segments, intermediate steps, and associated “thinking tokens”—that are not required for the correct completion of the task.

For rigorous evaluation, the dataset design precludes any ambiguity in expected output and penalizes generation patterns that are not strictly necessary. This ensures that observed overthinking arises from model architecture and decoding strategies, rather than from inherent data complexity.

3. Overthinking-Adjusted Accuracy and Evaluation Metrics

Core to OverthinkingBench analysis is the Overthinking-Adjusted Accuracy (OAA) metric. For a token threshold $t$ , OAA is defined as:

$\text{OAA}_t = \frac{1}{n} \sum_{i=1}^n \left[ \text{Correctness}_i \cdot \mathbb{1}(\text{ThinkTokens}_i < t) \right]$

where $\text{Correctness}_i$ indicates whether the $i$ -th sample is answered correctly, and $\text{ThinkTokens}_i$ is the count of reasoning (CoT) tokens used by the model. The indicator function enforces a penalty for excessive token generation above threshold $t$ .

For comprehensive assessment, OverthinkingBench reports the Area Under the OAA Curve (AUC):

$\text{AUC}_{\text{OAA}} = \int_0^{t_{\text{max}}} \frac{\text{OAA}_t}{t_{\text{max}}} \ \mathrm{d}t \approx \sum_{t=0}^{t_{\text{max}}} \frac{\text{OAA}_t}{t_{\text{max}}}$

This metric integrates performance over a range of token thresholds, yielding a robust measure of both correctness and efficiency.

Evaluation involves running each model on all OverthinkingBench queries, collecting response tokens, and computing OAA at multiple threshold values. Results are visualized in figures such as "auooa," which contrasts the AUC curves for non-thinking, overthinking, and ideal models.

4. Empirical Findings and Model Comparison

Extensive testing with 33 model variants, including proprietary and open-weight architectures, reveals that:

Thinking models routinely overthink on simple queries. These models generate hundreds of unnecessary chain-of-thought tokens, leading to substantial decreases in AUC—even as their raw accuracy remains high (Aggarwal et al., 18 Aug 2025).
Non-thinking models excel on OverthinkingBench, returning concise responses and achieving consistently high AUC scores. However, these same models often underperform on complex, multi-step reasoning tasks (assessed in UnderthinkingBench).
No current model achieves optimal thinking. Only a handful (roughly 5 of 33) reach an overall OptimalThinkingBench score above 50%. Proprietary model o3 scores 72.7%, while among open-weight models GPT-OSS-120B achieves 62.5%, but none balance overthinking and underthinking ideally.

A plausible implication is that high reasoning capacity does not confer efficiency on simple questions, necessitating further architectural and training innovations.

5. Methods for Mitigating Overthinking

Efforts to mitigate overthinking include:

Length-based reward shaping: modifies RL objectives to penalize longer CoTs.
Prompting: introduction of explicit directives such as "Don’t Overthink" in prompts. Experiments show token reductions of up to 29%, with only modest changes in accuracy.
Task-adaptive routing: deployment of a router module that selects between thinking and non-thinking response modes depending on input complexity. The authors report improved results but scores that remain well below a hypothetical oracle router.
Comparisons across methods show that while efficiency can be improved, gains on OverthinkingBench often incur a penalty on UnderthinkingBench, indicating a trade-off between efficiency and thoroughness.

These results suggest that controlling overthinking is non-trivial, as naïve suppression of reasoning tokens may compromise reasoning on genuinely complex queries.

6. Contextual Role in Unified Benchmarks and Future Directions

OverthinkingBench is paired with UnderthinkingBench within the OptimalThinkingBench framework to provide a unified assessment of both extremes: wasted computation on simple queries and insufficient reasoning on challenging tasks. The framework’s F₁-metric between overthinking AUC and underthinking accuracy highlights that model selection and optimization must consider both axes.

The current state of research underscores significant room for improvement. Future directions recommended include:

Unified model architectures with dynamic reasoning adaptation.
Improved routers for efficient mode selection.
Advanced prompt engineering and dynamic reward optimization.
Better conflict resolution between competing objectives of efficiency and accuracy.

By formalizing overthinking evaluation and pairing it with rigorous efficiency-adjusted metrics, OverthinkingBench guides the research community toward LLMs that are not only highly accurate but also optimally efficient in their reasoning allocation.

7. Relevant Formulas and Visualizations

Key formulas for OverthinkingBench:

Overthinking-Adjusted Accuracy:

$\text{OAA}_t = \frac{1}{n} \sum_{i=1}^n \left[ \text{Correctness}_i \cdot \mathbb{1}(\text{ThinkTokens}_i < t) \right]$

Area Under Curve:

$\text{AUC}_{\text{OAA}} = \int_0^{t_{\text{max}}} \frac{\text{OAA}_t}{t_{\text{max}}} \ \mathrm{d}t \approx \sum_{t=0}^{t_{\text{max}}} \frac{\text{OAA}_t}{t_{\text{max}}}$

Figure "auooa" provides a graphical depiction of AUC values for various model types, illustrating the penalty imposed by excessive reasoning on simple queries.

OverthinkingBench thus serves a crucial role in benchmarking, quantifying, and ultimately mitigating wasted computational effort in LLMs, especially for queries where efficient, minimal computation should be sufficient. Its integration with underthinking assessments sets a comprehensive standard for future model evaluation and development, emphasizing the need for truly optimal thinking in artificial intelligence systems (Aggarwal et al., 18 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OverthinkingBench.