Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning (2502.07154v3)

Published 11 Feb 2025 in cs.LG and cs.AI

Abstract: Recent progress in LLMs highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${\it misaligned}$ with pass@N in that pass@N accuracy ${\it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

Summary

The paper identifies that standard supervised fine-tuning with cross-entropy loss creates overconfidence, misaligning with pass@N test-time strategies and limiting performance gains from scaling compute.
To address this, Direct Coverage Optimization (DCO) is proposed, a modified training loss that limits overconfidence by attenuating gradients for highly confident examples, improving mathematical reasoning.
Experiments show DCO improves performance on mathematical reasoning benchmarks like MATH and MiniF2F, and a stepwise application enhances theorem proving by controlling exploration.

The paper "Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning" explores the intersection between model training protocols and test-time computation strategies in the context of LLMs, specifically addressing performance improvements in tasks involving mathematical reasoning.

Key Contributions and Findings

Misalignment with Cross-Entropy Loss: The authors identify a misalignment between the typical supervised fine-tuning with cross-entropy (CE) loss and the pass@N test-time strategy. The latter involves generating N independent samples and verifying if at least one is correct. Empirically, it is shown that minimizing CE loss can lead to decreasing pass@N performance for large N, as models become overconfident in their predictions.
Framework and Empirical Verification: A theoretical framework is developed to explain how overconfidence induced by CE loss hampers the potential gains from scaling test-time computation. The authors use pass@N coverage metrics to demonstrate how excessive confidence prevents the model from benefiting from additional test-time compute allocation.
Direct Coverage Optimization (DCO): To address the identified misalignment, the authors propose a modified training loss designed to directly optimize for pass@N coverage by discouraging overconfidence. The DCO loss attenuates the gradient for highly confident examples, which naturally limits model overconfidence.
Experimental Validation: Experiments on benchmarks such as MATH and MiniF2F illustrate that models trained with DCO achieve improved mathematical reasoning capabilities. The paper reports enhanced performance in theorem proving and short-answer provision when employing a modified loss function aligned with the pass@N strategy.
Stepwise Application in Theorem Proving: In the theorem-proving domain, the paper introduces a stepwise application of DCO, which controls the exploration at each proof step via a hyperparameter Neff. This approach effectively transitions search strategies from depth-limited exploration to wider search strategies depending on Neff, yielding significant gains in proving task success rates when these varied strategies are ensembled.

Theoretical Insights

Tradeoff between Exploration and Exploitation: Through theoretical analysis, it is demonstrated that achieving optimal pass@N performance necessitates balancing between confident exploitation and exploratory sampling. The findings suggest optimal policies should exhibit low confidence at large N to explore more solutions, whereas high confidence is beneficial at smaller N for effectively exploiting solutions.
Upper and Lower Bounds on Model Confidence: The paper mathematically derives bounds on model confidence based on the number of passes N, showing that optimal strategies that maximize pass@N must adjust their confidence levels according to the test-time compute strategy.

Broader Implications

The research underscores the need for closer integration between training protocols and test-time strategies, suggesting that traditional compartmentalization in LLM development might limit potential performance improvements achievable through efficient utilization of test-time compute. This paradigm shift prompts consideration of tailored training objectives that integrate anticipated test-time requirements, potentially leading to more adaptable and powerful LLM systems.

Overall, the work provides substantial insights into the limitations of current training methodologies and offers robust solutions to enhance LLM performance in computationally intensive reasoning tasks through strategic alignment of training and test-time objectives.

PDF Markdown

Tweets

https://twitter.com/SuryaGanguli/status/1892041365317439969

https://twitter.com/FCHEN_AI/status/1892030301603241998

https://twitter.com/_AmilDravid/status/1930329778051264704