Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information (2511.22176v1)

Published 27 Nov 2025 in cs.CL and cs.AI

Abstract: Recent LLMs achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.

Summary

The paper demonstrates that F-CoT restructures inputs by separating fact extraction and reasoning, achieving a 2–3× token reduction while maintaining accuracy.
It employs a two-stage prompting protocol—context extraction followed by focused reasoning—to minimize overthinking and improve inference efficiency.
Empirical results on mathematical benchmarks confirm drastic token savings, practical cost reductions, and potential for scalable LLM deployment.

Efficient LLM Reasoning via Structured Input: Summary of "Focused Chain-of-Thought" (2511.22176)

Introduction and Motivation

The paper explores inference efficiency in LLMs tackling complex reasoning tasks, primarily in mathematical domains. Traditional chain-of-thought (CoT) prompting improves interpretability and accuracy but incurs high computational and time costs due to lengthy reasoning traces. Existing methods to mitigate inference cost focus on altering model behavior—rewards for brevity, fine-tuning, or explicit instructions to reduce output verbosity. In contrast, this paper advances an input-centric, training-free approach: Focused Chain-of-Thought (F-CoT), which restructures the input representation by explicitly separating relevant information extraction from downstream reasoning.

F-CoT operationalizes principles from cognitive psychology, especially the Active Control of Thought (ACT) framework, emphasizing sequential and resource-efficient problem solving via distinct information extraction and reasoning phases. The hypothesis is that explicitly structured input blocks sharpen attention, minimize overthinking, and reduce extraneous token generation, thus enhancing inference efficiency without diminishing final reasoning correctness.

Figure 1: The F-CoT process: extraction of key facts into an XML context block, followed by focused reasoning that yields significantly shorter traces compared to standard CoT prompting.

Methods: Structured Reasoning Pipeline and Implementation

F-CoT utilizes a two-stage prompting protocol:

Context Extraction: The LLM receives a prompt instructing it to extract all information pertinent to the question into a compact, structured format (e.g., XML-like blocks with enumerated facts and the formal question). This can be either manually provided by the user or automatically generated—possibly by a distinct, larger LLM.
Reasoning over Context: The LLM is then prompted to perform stepwise reasoning solely with this compact context, explicitly citing relevant information blocks. The original verbose natural-language question is omitted.

The framework supports both user-defined contexts and automatic extraction through two-step LLM prompting. Empirically, structured formats (XML, enumerated lists) are effective not due to their syntax but their focus and compactness. The abstraction is robust—other schema can also be used for context—but stronger structure further decreases average output length.

Experimental Evaluation and Results

Experiments focus on arithmetic and mathematical benchmarks (SVAMP, GSM-Hard, MATH-500, AIME) using Qwen3 models (0.6B to 32B) and GPT-5 mini for context extraction. Performance is measured via Pass@5 score and average token length per problem, comparing F-CoT with standard zero-shot CoT and other prompting approaches (Plan-and-Solve, CoRe).

Strong empirical claims:

F-CoT consistently achieves 2–3× reduction in token generation compared to standard CoT, with no significant loss in accuracy. For instance, Qwen3-32B drops from ~4.4k tokens to ~2.3k average length on MATH-500, maintaining Pass@5 equivalence.
The efficiency gain persists across precomputed (provided by GPT-5) and self-generated contexts (by the LLM itself), provided that the model is sufficiently capable (≥ 4B parameters reliably extract valid context).
Smaller models struggle with reliable context extraction, leading to degraded reasoning performance—a larger extraction model followed by small-model reasoning is therefore practical and cost-efficient.
Figure 2: F-CoT exhibits drastic token count reductions while matching (and occasionally exceeding) 0-CoT accuracy across Qwen3 model sizes and benchmark datasets.

Further annotation of reasoning traces shows:

Overall token distribution per sentence type (extraction, reasoning, filler) remains stable, but the number of reasoning and filler sentences is significantly reduced with F-CoT.
The reduction in tokens is attributed primarily to sharper focus and less irrelevant commentary or meta-reasoning (overthinking), confirmed by lower overthinking scores computed via posthoc annotation.
Figure 3: Token allocation per sentence type and counts: F-CoT halves reasoning and filler sentences versus 0-CoT, maintaining extraction volume.

Contradictory or Bold Findings:

Explicit instructions or fine-tuning to promote brevity are not necessary; restructuring the input alone yields substantial efficiency gains.
Introducing structured context to already condensed mathematical problems (e.g., AIME) gives further token savings but may slightly reduce accuracy, possibly due to test set contamination or loss of implicit contextual cues.

Sensitivity and Ablation:

Removing explicit prompt instructions or providing both the question and context slightly increases token count and accuracy, indicating robustness but optimality for F-CoT's strict context-only approach.
Increasing the degree of input structure (numbering, formal blocks) decreases token count further without affecting accuracy.
Context validity and extraction reliability saturate at mid-scale architecture sizes; larger models do not further improve structured extraction.

Implications and Future Directions

Practical Implications

F-CoT provides a general, training-free method to accelerate LLM inference for reasoning tasks, agnostic to underlying architecture or training protocol. It is highly practical for real-world deployment where latency and compute cost are critical, especially in agentic and math-intensive domains. The separation of extraction and reasoning phases not only improves interpretability and debugging, but also pipeline modularity for scalable LLM service ecosystems; context blocks can be reused, cached, or iteratively updated with minimal rework.

Theoretical Implications

The results suggest that LLMs overthink and attend to irrelevant aspects of raw, verbose input due to the entwined nature of extraction and reasoning in natural language. Decoupling these steps via structured input mimics resource-efficient, human-like cognitive architectures. This aligns with theories of attention and working memory, indicating possible directions for architectural or training-time modifications—training LLMs to natively process and reason over structured input formats.

Outlook and Future AI Research

Extension of F-CoT is promising for multimodal reasoning tasks, e.g., vision-LLMs, where compact, structured representations of visual facts may similarly reduce inference cost. It can be integrated with advanced test-time scaling strategies (e.g., tree-of-thought, s1 scaling) or as part of dynamic agentic context ("notepad") updating schemes. Future work may also investigate F-CoT's application in training-time interventions or direct architecture modifications to further embed efficient reasoning priors.

Conclusion

F-CoT establishes that input-centric restructuring—explicit information extraction followed by context-bound reasoning—is a powerful alternative to model-centric optimization for efficient LLM reasoning. The method achieves substantial reductions in token usage and inference latency, maintaining or exceeding baseline accuracy. The findings highlight the latent efficiency gains unlocked by structured input representation, providing new avenues for practical and theoretical advancement in scalable AI reasoning systems.