Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (2503.16419v3)

Published 20 Mar 2025 in cs.CL

Abstract: LLMs have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small LLMs, and discuss evaluation methods and benchmarking.

Summary

  • The paper introduces a categorization of efficient reasoning techniques in LLMs to mitigate overthinking and lower inference costs.
  • It evaluates model-based, reasoning output-based, and input prompt-based methods, emphasizing RL reward designs and CoT compression.
  • The survey highlights practical applications and benchmarks that balance computational efficiency with reasoning accuracy in real-world deployments.

This paper, "Stop Overthinking: A Survey on Efficient Reasoning for LLMs" (2503.16419), provides a comprehensive overview of techniques aimed at making the reasoning processes of LLMs more computationally efficient without sacrificing accuracy. It addresses the "overthinking phenomenon" where models like OpenAI o1 and DeepSeek-R1, while capable of complex reasoning using Chain-of-Thought (CoT), generate excessively long and redundant reasoning steps, leading to high inference costs and latency.

The survey categorizes efficient reasoning methods into three main areas:

  1. Model-based Efficient Reasoning: Focuses on modifying the model itself.
    • RL with Length Reward Design: Integrates penalties for reasoning length into the Reinforcement Learning (RL) reward function during training. Different methods propose various formulations for this length reward, often balancing it with accuracy (e.g., O1-Pruner (2501.12570), Kimi k1.5 (2501.12599), L1 (2503.04697), Demystifying (2502.03373), DAST (2503.04472)). Optimization often uses PPO or SimPO.
    • SFT with Variable-Length CoT Data: Fine-tunes models using Supervised Fine-Tuning (SFT) on datasets containing CoT examples of varying lengths, particularly shorter, concise reasoning paths. This involves methods to generate short CoT data (e.g., sampling shortest paths, using LLMs as compressors, interpretation-driven skipping, token budget constraints) and applying standard or progressive fine-tuning techniques (e.g., Self-Training (2502.20122), C3oT (2412.11664), TokenSkip (2502.12067), CoT-Valve (2502.09601), Learn to Skip (2411.01855)).
  2. Reasoning Output-based Efficient Reasoning: Modifies the generation process during inference.
    • Compressing Reasoning Steps into Latent Representation: Replaces explicit textual reasoning steps with more compact, non-textual latent representations (hidden states or learned tokens). This can involve training the LLM to use these latent steps (e.g., Coconut (2412.06769), CODI (2502.21074), CCOT (2412.13171), Heima (2501.19201), Token Assorted (2502.03275), Looped Transformers (2502.17416)) or using auxiliary modules while keeping the main LLM frozen (e.g., SoftCoT (2502.12134)).
    • Dynamic Reasoning Paradigm during Inference: Adapts the reasoning strategy or resource allocation at inference time based on dynamic criteria. This includes reward-guided methods (Speculative Rejection (2410.20290), RSD (2501.19324)), confidence/certainty-based approaches (DPTS (2502.16235), Certaindex (2412.20993), FastMCTS (2502.11476), Length-filtered Vote (2502.07266)), consistency-based selection (ST-BoN (2503.01422)), and summarization techniques where intermediate steps are condensed (LightThinker (2502.15589), InftyThink (2503.06692)).
  3. Input Prompts-based Efficient Reasoning: Leverages characteristics of the input prompt.
    • Prompt-guided Efficient Reasoning: Uses specific instructions within the prompt to guide the model towards generating shorter reasoning chains (e.g., TALE-EP (2412.18547), Chain of Draft (2502.18600), CCoT [Renze2024benefits], Token Complexity paper (2503.01141)).
    • Routing by Question Attributes: Directs input queries to different models or reasoning paths based on estimated difficulty or uncertainty. This can involve unknown criteria (Claude 3.7 Sonnet), trained classifiers (RouteLLM (2406.18665), SoT (2503.05179)), or intrinsic uncertainty metrics (Self-Ref (2410.13284), Confident or Seek Stronger (2502.04428)).

The survey also covers related topics:

  • Efficient Data and Models: Training reasoning models effectively with less, high-quality data (LIMO (2502.03387), s1 (2501.19393)), and enabling reasoning in Small LLMs (SLMs) through distillation techniques (mixed, counterfactual, feedback-driven) and model compression (quantization is found more effective than pruning for reasoning).
  • Evaluation and Benchmarks: Discusses benchmarks like Sys2Bench (2502.12521) for evaluating diverse reasoning tasks, frameworks for measuring the "overthinking" phenomenon (2502.08235), and research on compute-optimal test-time scaling strategies (2502.06703).

Finally, the paper touches upon applications in areas like autonomous driving, embodied AI, and healthcare, and discusses broader challenges such as the trade-off between safety and efficiency, and the relative merits of RL versus SFT for achieving efficient reasoning. It concludes by emphasizing the practical importance and economic value of developing efficient reasoning capabilities in LLMs.