Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Reasoning Models Can Be Effective Without Thinking (2504.09858v1)

Published 14 Apr 2025 in cs.AI and cs.CL

Abstract: Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.

Summary

  • The paper demonstrates that bypassing detailed chain-of-thought with NoThinking reduces token usage by 2x-5x while maintaining strong task performance.
  • The approach uses budget forcing to enable fair comparisons, showing its advantages in low-budget settings and effective parallel scaling.
  • Experimental results reveal that NoThinking achieves competitive accuracy with significantly lower latency, challenging the need for extensive reasoning steps.

This paper investigates whether the explicit, lengthy "Thinking" process commonly generated by state-of-the-art reasoning LLMs is truly necessary for achieving high performance on complex reasoning tasks. Models like DeepSeek-R1 are often trained to produce a detailed chain-of-thought, including reflection and backtracking, before outputting the final solution. While this improves reasoning, it significantly increases token usage and latency.

The authors propose a simple prompting technique called NoThinking, which bypasses the explicit reasoning step. Instead of letting the model generate its detailed thought process, the NoThinking method prefills the model's response with a minimal, fabricated thinking block, forcing it to proceed directly to generating the final solution steps and answer.

Implementation of NoThinking

The core idea is to modify the prompt structure. Standard "Thinking" models expect input that guides them to produce a structure like:

1
2
3
4
5
<|beginning_of_thinking|>
... detailed reasoning steps ...
<|end_of_thinking|>
... final solution steps ...
Final Answer: ...

The NoThinking approach uses a prompt that effectively skips the detailed reasoning:

1
2
3
4
5
<|beginning_of_thinking|>
Okay, I think I have finished thinking.
<|end_of_thinking|>
... model generates final solution steps ...
Final Answer: ...

This is achieved by providing the dummy thinking block as part of the assistant's prefilled response during inference.

Budget Forcing for Fair Comparison

To compare NoThinking against standard Thinking under controlled conditions, especially regarding computational cost, the paper employs a budget forcing technique. When generating a response (for either Thinking or NoThinking), a maximum token limit (max_tokens) is set. If the model hits this limit before naturally finishing:

  • If still inside the thinking block (only applicable to Thinking), <|end_of_thinking|> is appended.
  • The prompt Final Answer: (or appropriate delimiters for coding tasks like ) is appended to force the model to provide a direct answer.

This allows comparing the methods at similar average token counts.

Experimental Setup

  • Model: Primarily DeepSeek-R1-Distill-Qwen-32B, a model specifically trained for structured reasoning. Qwen-32B-Instruct (the base model without specific reasoning training) serves as a baseline. Smaller R1 models (7B, 14B) were also tested.
  • Benchmarks: A diverse set including:
    • Math: AIME 2024/2025, AMC 2023, OlympiadBench (Math subset)
    • Coding: LiveCodeBench
    • Theorem Proving: MiniF2F, ProofNet
  • Metrics: Pass@k (probability of getting at least one correct answer in k samples), average token usage, and latency (for parallel scaling experiments).

Key Findings

  1. Effectiveness of NoThinking:
    • Without Budget Control: NoThinking uses 2x-5x fewer tokens than Thinking. On theorem proving tasks (MiniF2F, ProofNet), NoThinking achieves comparable performance to Thinking across all values of k. On other tasks, NoThinking often lags behind Thinking at pass@1 but catches up or even surpasses it as k increases (e.g., at pass@64), indicating it generates a diverse set of potentially correct answers.
    • With Controlled Budget: When controlling for token count using budget forcing, NoThinking consistently outperforms Thinking, particularly in low-budget regimes (e.g., < 3000 tokens). The advantage of NoThinking becomes more pronounced as k increases. For pass@1, NoThinking is better at low budgets, while Thinking can be better at high budgets on some tasks.
    • Pareto Efficiency: Accuracy-vs-token plots show NoThinking achieves a better Pareto frontier (better trade-off) than Thinking, especially for pass@k with k > 1.
  2. Parallel Scaling Performance:
    • NoThinking's strength at higher k values makes it suitable for parallel scaling: generating N responses in parallel and using a selection method (best-of-N).
    • Selection Methods:
      • For tasks with perfect verifiers (MiniF2F, ProofNet using the Lean compiler), the best correct answer among N samples is selected.
      • For tasks without verifiers (math, coding), simple strategies like confidence-based selection (using model's self-certainty score and Borda voting) or majority voting are used.
    • Results:
      • With Verifiers: NoThinking with parallel scaling matches the pass@1 accuracy of full sequential Thinking but achieves this with 7x lower latency and 4x fewer total tokens.
      • Without Verifiers: NoThinking combined with parallel scaling and confidence-based selection significantly outperforms Thinking (even Thinking with parallel scaling) when compared at similar latency levels. On OlympiadBench, it achieved higher pass@1 accuracy than full sequential Thinking with 9x lower latency.

Practical Implications

  • Efficiency: NoThinking provides a way to achieve strong reasoning performance with significantly reduced computational cost (fewer tokens, lower latency) compared to standard lengthy chain-of-thought prompting, especially useful in resource-constrained settings.
  • Latency: By enabling effective parallel scaling, NoThinking allows for much faster inference times for complex reasoning tasks while maintaining or even improving accuracy compared to sequential Thinking methods.
  • Implementation Simplicity: The NoThinking method is easy to implement via simple prompt modification, requiring no changes to the model architecture or additional training.
  • Reconsidering Reasoning Mechanisms: The results challenge the assumption that explicit, long thinking chains are indispensable for high performance in reasoning models trained for this format. Models might possess reasoning capabilities that can be elicited more directly.

In conclusion, the paper demonstrates that bypassing the explicit "Thinking" phase in reasoning models via the "NoThinking" prompting strategy can be surprisingly effective. It offers substantial gains in efficiency (tokens, latency) and enables powerful parallel inference strategies, providing a competitive baseline for practical, low-budget, or low-latency reasoning applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com