- The paper introduces a categorization of efficient reasoning techniques in LLMs to mitigate overthinking and lower inference costs.
- It evaluates model-based, reasoning output-based, and input prompt-based methods, emphasizing RL reward designs and CoT compression.
- The survey highlights practical applications and benchmarks that balance computational efficiency with reasoning accuracy in real-world deployments.
This paper, "Stop Overthinking: A Survey on Efficient Reasoning for LLMs" (2503.16419), provides a comprehensive overview of techniques aimed at making the reasoning processes of LLMs more computationally efficient without sacrificing accuracy. It addresses the "overthinking phenomenon" where models like OpenAI o1 and DeepSeek-R1, while capable of complex reasoning using Chain-of-Thought (CoT), generate excessively long and redundant reasoning steps, leading to high inference costs and latency.
The survey categorizes efficient reasoning methods into three main areas:
- Model-based Efficient Reasoning: Focuses on modifying the model itself.
- RL with Length Reward Design: Integrates penalties for reasoning length into the Reinforcement Learning (RL) reward function during training. Different methods propose various formulations for this length reward, often balancing it with accuracy (e.g., O1-Pruner (2501.12570), Kimi k1.5 (2501.12599), L1 (2503.04697), Demystifying (2502.03373), DAST (2503.04472)). Optimization often uses PPO or SimPO.
- SFT with Variable-Length CoT Data: Fine-tunes models using Supervised Fine-Tuning (SFT) on datasets containing CoT examples of varying lengths, particularly shorter, concise reasoning paths. This involves methods to generate short CoT data (e.g., sampling shortest paths, using LLMs as compressors, interpretation-driven skipping, token budget constraints) and applying standard or progressive fine-tuning techniques (e.g., Self-Training (2502.20122), C3oT (2412.11664), TokenSkip (2502.12067), CoT-Valve (2502.09601), Learn to Skip (2411.01855)).
- Reasoning Output-based Efficient Reasoning: Modifies the generation process during inference.
- Compressing Reasoning Steps into Latent Representation: Replaces explicit textual reasoning steps with more compact, non-textual latent representations (hidden states or learned tokens). This can involve training the LLM to use these latent steps (e.g., Coconut (2412.06769), CODI (2502.21074), CCOT (2412.13171), Heima (2501.19201), Token Assorted (2502.03275), Looped Transformers (2502.17416)) or using auxiliary modules while keeping the main LLM frozen (e.g., SoftCoT (2502.12134)).
- Dynamic Reasoning Paradigm during Inference: Adapts the reasoning strategy or resource allocation at inference time based on dynamic criteria. This includes reward-guided methods (Speculative Rejection (2410.20290), RSD (2501.19324)), confidence/certainty-based approaches (DPTS (2502.16235), Certaindex (2412.20993), FastMCTS (2502.11476), Length-filtered Vote (2502.07266)), consistency-based selection (ST-BoN (2503.01422)), and summarization techniques where intermediate steps are condensed (LightThinker (2502.15589), InftyThink (2503.06692)).
- Input Prompts-based Efficient Reasoning: Leverages characteristics of the input prompt.
- Prompt-guided Efficient Reasoning: Uses specific instructions within the prompt to guide the model towards generating shorter reasoning chains (e.g., TALE-EP (2412.18547), Chain of Draft (2502.18600), CCoT [Renze2024benefits], Token Complexity paper (2503.01141)).
- Routing by Question Attributes: Directs input queries to different models or reasoning paths based on estimated difficulty or uncertainty. This can involve unknown criteria (Claude 3.7 Sonnet), trained classifiers (RouteLLM (2406.18665), SoT (2503.05179)), or intrinsic uncertainty metrics (Self-Ref (2410.13284), Confident or Seek Stronger (2502.04428)).
The survey also covers related topics:
- Efficient Data and Models: Training reasoning models effectively with less, high-quality data (LIMO (2502.03387), s1 (2501.19393)), and enabling reasoning in Small LLMs (SLMs) through distillation techniques (mixed, counterfactual, feedback-driven) and model compression (quantization is found more effective than pruning for reasoning).
- Evaluation and Benchmarks: Discusses benchmarks like Sys2Bench (2502.12521) for evaluating diverse reasoning tasks, frameworks for measuring the "overthinking" phenomenon (2502.08235), and research on compute-optimal test-time scaling strategies (2502.06703).
Finally, the paper touches upon applications in areas like autonomous driving, embodied AI, and healthcare, and discusses broader challenges such as the trade-off between safety and efficiency, and the relative merits of RL versus SFT for achieving efficient reasoning. It concludes by emphasizing the practical importance and economic value of developing efficient reasoning capabilities in LLMs.