- The paper demonstrates that brief chain-of-thought (8–32 tokens) significantly boosts function-calling accuracy, outperforming no-CoT baselines.
- It reveals that extended reasoning increases function hallucinations and wrong selections, leading to a collapse in performance.
- The introduction of Function-Routing CoT (FR-CoT) achieves peak accuracy with zero hallucinations, eliminating the need for budget tuning.
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Summary
This work provides a systematic and granular analysis of the relationship between chain-of-thought (CoT) reasoning length and task accuracy in function-calling LLM agents. The central finding is that CoT length exhibits a non-monotonic effect on function-calling accuracy: brief reasoning (8–32 tokens) sharply improves accuracy, but longer CoT traces degrade performance to significantly below direct-answer (no-CoT) baselines. This non-monotonicity is isolated on structured tool-use settings, using the Berkeley Function Calling Leaderboard v3 (BFCL) Multiple split and validated across several models and architectures. The authors dissect the error mechanisms and introduce Function-Routing CoT (FR-CoT), a structured brief-CoT prompting protocol, which achieves the same peak accuracy as unconstrained brief CoT while providing zero function hallucination without the need for budget tuning.
Experimental Framework
Experiments were primarily conducted on Qwen2.5-1.5B-Instruct and cross-validated with Qwen2.5-7B-Instruct and Phi-3-mini-4k-instruct (3.8B). The benchmark comprises 200 BFCL v3 Multiple tasks, each a prompt with 2–4 candidate function schemas requiring structured action selection and argument infilling. Six CoT budgets (0, 32, 64, 128, 256, 512 tokens) were exhaustively evaluated with greedy decoding, measuring accuracy as exact function and argument match to ground truth.
The study introduces a three-way error decomposition: (i) hallucinated function (non-candidate function names), (ii) wrong valid function (wrong, but in-candidate, selection), and (iii) wrong arguments, with unparseable JSON as a residual class. Additionally, pre-reasoning entropy (H0​) is computed on the action distribution to probe uncertainty-gated budget allocation.
Core Empirical Results
The experiments unequivocally demonstrate:
- Non-Monotonic Budget Effect: For Qwen2.5-1.5B, brief CoT (32 tokens) achieves 64.0% accuracy (+45% relative to no-CoT at 44.0%), while longer budgets (256 tokens) collapse to 25.0%—significantly below the no-CoT baseline (p<0.001). Fine-grained sweeps show the true optimum at 8–16 tokens (69.0% at d=16).
- Error Mechanism: At d=0, function selection is the primary failure (30.5% wrong valid function). Brief CoT eliminates this (1.5%), acting as a strong routing prior that anchors output. At d=256, the error pattern reverses: function hallucination rises (18.0%), and wrong valid selection resurges (28.0%), confirming that extended reasoning actively misdirects the agent.
- Oracle Budget Requirement: 88.6% of solvable tasks have an optimal budget of ≤32 tokens (mean: 27.6). Longer budgets do not provide incremental gains on the vast majority of tasks.
- Function-Routing CoT (FR-CoT): Introducing a structured routing step (prompt: "Function: [name] / Key args: [...]") locks the model into a valid candidate before argument generation. FR-CoT matches unconstrained d=32 accuracy (64.0% 1.5B, 83.0% 7B) with zero hallucinations, and requires no budget sweep.
- Architecture Dependence: Qwen2.5 models (1.5B/7B) display severe below-baseline collapse at long budgets. Phi-3-mini, despite peaking at d=32, degrades monotonically but remains above its no-CoT baseline. This is traced to its high end-of-sequence (EOS) self-termination rate at high budgets, which serves as a natural safeguard against over-generation.
- Entropy-Gated Computation: Pre-reasoning entropy provides only weak, non-significant directional signals regarding when CoT helps; no H0​-based gating policy outperforms always using d=32.
Theoretical and Practical Implications
The results have direct implications for the design and deployment of tool-using LLM agents:
- Optimal Reasoning Budget is Very Short: The empirical optimum is at or below 32 tokens. Longer CoT does not introduce gains—in fact, it consistently renders performance worse than direct answers, due to compounding misdirection and function hallucination. This stands in sharp contrast with arithmetic or open-text reasoning domains, where longer CoT traces can increase performance.
- Mechanistic Explanation of Overthinking: Unlike overthinking in math LLMs, where extended reasoning leads to path abandonment, in structured function-calling the dual effect is (1) a re-emergence of function selection errors and (2) the introduction of hallucinated, non-candidate actions. These errors are exacerbated by strict output schema requirements and prompt demotion of the answer format.
- Structured Prompting Is Superior to Output Constraints: FR-CoT outperforms log-prob constrained decoding, particularly on larger models. Output constraints can eliminate hallucinations but do not provide the routing inductive bias necessary for high accuracy, and they introduce distribution shifts due to prefix injection during constrained generation.
- Cross-Model Robustness and Architectural Effects: The brief-CoT optimum generalizes across models, but the severity of collapse with long budgets is architecture dependent. Models with a natural tendency to self-terminate (high EOS rates) are more resilient against extended CoT misdirection.
- Compute-Efficient Recommendations: For practitioners, a fixed brief reasoning budget (8–32 tokens) is nearly optimal—balancing accuracy and FLOPs. FR-CoT is preferable in high-reliability deployments, as it achieves zero hallucination without budget tuning or model modification.
- Test-Time Compute Scaling Paradigms Do Not Universally Transfer: Approaches such as "always allocate more CoT for harder problems" must be revisited for structured action domains; adding reasoning steps is not always beneficial and can be harmful.
Limitations and Future Directions
The work is bounded to current open-source function-calling models (Qwen2.5, Phi-3-mini) and the BFCL benchmark. Results may shift for models explicitly trained with deep CoT supervision (e.g., OpenAI o1, DeepSeek-R1), or in multi-step, multi-turn agentic scenarios. Additionally, more nuanced gating signals may be realized by combining entropy with features capturing argument complexity and schema similarity. There is also scope for integrating CoT within the output JSON or via grammar-constrained generation to potentially sidestep the format/CoT tradeoff.
Future research directions include:
- Extending analysis to models with explicit CoT tuning or more recent architectures
- Studying multi-turn, multi-action pipelines and the interplay of CoT budget at the episode or task level
- Exploring richer adaptive computation signals (e.g., full-prefix entropy, argument-specific uncertainty)
- Formalizing structured reasoning protocols beyond FR-CoT for broader classes of LLM-driven agents
Conclusion
This paper establishes that for function-calling language agents, brief chain-of-thought reasoning is not only sufficient but optimal, and increasing CoT budget is detrimental beyond sharp, model-specific thresholds. Explicit structured routing (FR-CoT) matches the performance ceiling of unconstrained brief CoT while eliminating hallucination, providing a simple, robust solution for agent system designers. These findings challenge the paradigm of unbounded test-time reasoning in structured domains and highlight the need for careful calibration of "thinking budgets" in LLM-powered tool agents. The work also motivates structured prompting as a more effective strategy than output-level constraints and provides a robust empirical basis for brief reasoning as a practical default in function-calling architectures.
Reference: "Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents" (2604.02155)