Papers
Topics
Authors
Recent
Search
2000 character limit reached

Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Published 5 Aug 2025 in cs.CL | (2508.03363v4)

Abstract: Reasoning LLMs (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in-distribution performance to training-based SOTA reasoning method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

Summary

  • The paper introduces the JointThinking approach, integrating Thinking and Nothinking modes to improve reasoning accuracy by up to 6.21%.
  • It employs a robust dual-mode consistency check and a second-thinking step to mitigate overthinking and refine outputs.
  • Experiments demonstrate enhanced out-of-distribution performance and scalability, underscoring the practical benefits of calibration in LLMs.

Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning LLMs

In this essay, we explore the comprehensive methodology, implementation, and experimental validation of "Thinking with Nothinking Calibration," a paradigm for In-Context Learning (ICL) in Reasoning LLMs (RLLMs). This paradigm leverages the structural differences between two reasoning modes—Thinking and Nothinking—to enhance performance on reasoning tasks.

Introduction

Reasoning LLMs (RLLMs) such as DeepSeek-R1 and Qwen3 have emerged as pivotal entities in handling complex problem-solving scenarios, primarily due to their structured inference paradigms. However, these models frequently suffer from "Overthinking," where excessive reasoning can obfuscate simple solutions. The paper introduces a method termed JointThinking that employs parallel generation in two modes, subsequently calibrated to use consistency checks for determining the necessity of further reasoning.

Methodology

Thinking and Nothinking Modes

The JointThinking framework outlines two parallel reasoning approaches:

  1. Thinking Mode: This mode facilitates comprehensive reasoning by generating a detailed thought process from the problem statement.
  2. Nothinking Mode: Conversely, this mode directly seeks solutions by bypassing intermediate reasoning steps, reducing the chance of getting entangled in complex layers unnecessarily.

The integration of these modes allows models to operate optimally across varying complexities of tasks by selecting the appropriate reasoning strategy for the problem at hand. Figure 1

Figure 1: Comparison of R1-7B using Thinking and Nothinking on MATH500. Each mode reliably compensates for the other's failures across all difficulty levels.

Consistency Check and Second Thinking

The method implements a consistency check mechanism to compare outputs from both modes. If the outputs disagree, a second round of targeted and reasoned thinking (termed Second Thinking) is triggered. This is operationalized using both outputs as prompts for deeper reasoning, ensuring thorough verification. Figure 2

Figure 2: Thinking with Nothinking Calibration (JointThinking). Given one question, the reasoning LLM generates two outputs in parallel.

Experiments and Results

The proposed paradigm was evaluated comprehensively against existing methods like few-shot Chain of Thought (CoT) and majority voting baselines across diverse mathematical reasoning benchmarks. Data reveals that JointThinking consistently outperforms these methods, with empirical results showing enhanced performance on both in-distribution tasks (e.g., GSM8K, MATH500) and notable gains on out-of-distribution benchmarks.

Comparative Performance

Across several model sizes, JointThinking demonstrated robust effectiveness:

  • R1-7B and R1-14B Models: JointThinking showed an average increase of 2.94% in accuracy over the second-best baseline and improved by 6.21% over single Thinking mode operations.
  • Generalization: JointThinking achieved superior outcomes in out-of-distribution generalization tasks, outperforming the SOTA AdaptThink method, showcasing its versatility and application breadth. Figure 3

    Figure 3: Performance comparison between optional and always second thinking. The optional trigger setting yields better results with less computational overhead.

Calibration and Scalability

Significantly, the consistency calibration has lowered error rates, proving effective cross-mode calibration benefits. The scalability was demonstrated with increasing model sizes further narrowing the gap between predicted outcomes and ideal solutions, indicating potential for even larger models. Figure 4

Figure 4: Scaling trend of second-thinking performance. The reduction in performance gap from the ideal situation indicates the strong scalability of jointthinking.

Discussion

JointThinking introduces a compelling approach to mitigate the pitfalls of overthinking by effectively harnessing simplicity when possible. The evident improvement in applying mixed-mode calibration hints at structural shortcomings in existing RLLMs, suggesting future explorations into multimodal learning approaches where reasoning capacities can be contextually optimized.

Conclusion

JointThinking offers a distinctive strategy in RLLMs to tackle diverse reasoning challenges through a balanced approach, effectively combining detailed and direct reasoning modes. The methodology not only enhances current performance paradigms but also paves the way for future advancements in scalable, context-aware language systems, presenting a holistic improvement in computational reasoning capabilities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.