Cause of Qwen2.5-7B’s severe long-budget collapse
Determine whether the more severe accuracy collapse observed for Qwen2.5-7B-Instruct at large chain-of-thought token budgets (d ≥ 128) on the BFCL v3 Multiple-function benchmark is caused by richer, more self-consistent reasoning chains that the subsequent JSON answer-generation phase cannot override once they have diverged; to do so, re-run the Qwen2.5-7B-Instruct experiments with full reasoning-output storage and perform error-category analysis across budgets.
References
Note: because the stored 7B output text is truncated at 300 chars (see Table~\ref{tab:multimodel} caption), the error-category breakdown is not available for 7B at $d{\geq}128$. We hypothesize---though cannot confirm from this data---that the 7B model's more severe collapse may stem from richer, more self-consistent reasoning chains that are harder for the answer phase to override once they have diverged; testing this hypothesis requires re-running the 7B experiment with full output storage to enable error-category analysis.