Papers
Topics
Authors
Recent
Search
2000 character limit reached

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Published 11 Nov 2025 in cs.CL, cs.AI, cs.LG, and cs.PF | (2511.08577v1)

Abstract: Improving reasoning capabilities of LLMs, especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

Summary

  • The paper introduces TaH, a dynamic iteration method that selectively refines hard tokens to mitigate latent overthinking.
  • It employs a neural decider and LoRA modules to focus iterations, leading to an 8.1–11.3% accuracy boost on reasoning benchmarks.
  • The findings highlight parameter-efficient improvements, making LLMs more effective for complex reasoning tasks in resource-constrained settings.

"Think-at-Hard: Selective Latent Iterations to Improve Reasoning LLMs"

Introduction

The research paper "Think-at-Hard: Selective Latent Iterations to Improve Reasoning LLMs" by Tianyu Fu et al. proposes a novel approach to overcoming the limitations in reasoning capabilities of LLMs while maintaining parameter efficiency. The paper identifies and addresses the problem of latent overthinking within recurrent transformers, which occurs when redundancy in token iterations leads to the corruption of correct token predictions. This is particularly pertinent for tokens that are easily predicted during the initial forward pass, where further iterations may degrade performance.

Methods and Architecture

The solution proposed in this study, termed Think-at-Hard (TaH), is a dynamic iterative mechanism that allocates deeper iterations only to hard tokens, as identified by a lightweight neural decider. The approach shifts focus through Low-Rank Adaptation (LoRA) modules specifically during latent iterations, transitioning the LLM's target from general next-token prediction to the refinement of complex tokens. This dynamic selectivity is facilitated by a comprehensive model architecture that incorporates a duo-causal attention mechanism to extend focus from token sequences to iteration depth, thereby maintaining cross-iteration information flow without disrupting sequence-level parallelism.

The iteration process is controlled by a neural decider that predicts which tokens require further iteration based on their likelihood of being incorrect after the initial forward pass. This approach negates the need for uniform iteration across all tokens, thereby preventing computational waste and incorrect prediction adjustments on tokens that are already accurately predicted.

Empirical Results

The paper presents empirical validation across several reasoning benchmarks, demonstrating TaH's ability to enhance LLM reasoning performance without additional parameters. Comparatively, TaH achieved an accuracy improvement of 8.1-11.3% over models using twice the iterations for all tokens. The paper further reports an increase of 4.0-5.0% when benchmarked against the finetuned single-iteration Qwen3 models.

Further improvements were noted when allowing less than 3% additional parameters through LoRA and the iteration decider. Here, TaH showed gains of 8.5-12.6% and 5.3-5.4%, respectively. These results underscore TaH's ability to efficiently optimize reasoning tasks with minimal additional computational resource allocation.

Implications and Future Work

The implications of this research are substantial for both theoretical and practical AI applications. The capability to selectively iterate enhances the efficiency of smaller, more computationally affordable LLMs, opening avenues for edge computing applications where resources are limited. This also hints at future potential developments where iteration depths are dynamically optimized in real-time, enhancing model adaptability to diverse problem domains without necessitating model retraining.

Practically, TaH could revolutionize domains requiring precise reasoning under stringent computational limitations. Future research directions may include extending this framework to other neural computation tasks, exploring deeper interaction between the iteration decider and different attention schemas, or integrating reinforcement learning strategies to optimize iterative decision-making policies dynamically.

Conclusion

The Think-at-Hard methodology marks a significant step towards refining the token-specific reasoning capabilities of LLMs while preserving efficiency in parameter utilization. Through selective iteration and architecture optimization, TaH surpasses existing recurrent transformer methodologies, emphasizing specialization in latent reasoning for hard tokens. This study paves the way for more adaptable, efficient LLMs that can cater to a broader spectrum of complex reasoning tasks without necessitating an expansive computational footprint.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 169 likes about this paper.