Cause of Qwen2.5-7B’s severe long-budget collapse

Determine whether the more severe accuracy collapse observed for Qwen2.5-7B-Instruct at large chain-of-thought token budgets (d ≥ 128) on the BFCL v3 Multiple-function benchmark is caused by richer, more self-consistent reasoning chains that the subsequent JSON answer-generation phase cannot override once they have diverged; to do so, re-run the Qwen2.5-7B-Instruct experiments with full reasoning-output storage and perform error-category analysis across budgets.

Background

The paper reports a non-monotonic chain-of-thought (CoT) budget effect across models on the BFCL v3 Multiple-function benchmark. While both Qwen2.5-1.5B and Qwen2.5-7B peak at brief budgets, the 7B model exhibits a more severe accuracy collapse at long budgets (e.g., a large drop from d=32 to d=128).

Due to output-text truncation, the authors could not compute the error-category breakdown for the 7B model at d ≥ 128, leaving the mechanism behind the more severe collapse unresolved. They hypothesize it may result from richer, more self-consistent reasoning chains that the answer-generation phase cannot override, and suggest re-running the 7B experiments with full output storage to enable error-category analysis.

References

Note: because the stored 7B output text is truncated at 300 chars (see Table~\ref{tab:multimodel} caption), the error-category breakdown is not available for 7B at $d{\geq}128$. We hypothesize---though cannot confirm from this data---that the 7B model's more severe collapse may stem from richer, more self-consistent reasoning chains that are harder for the answer phase to override once they have diverged; testing this hypothesis requires re-running the 7B experiment with full output storage to enable error-category analysis.

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents  (2604.02155 - Qi, 2 Apr 2026) in Section 6.2 (Cross-Scale: Qwen2.5-7B-Instruct)