Improving Large Language Model Fine-tuning for Solving Math Problems (2310.10047v1)

Published 16 Oct 2023 in cs.CL

Abstract: Despite their success in many natural language tasks, solving math problems remains a significant challenge for LLMs. A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.

PDF Abstract

Improving LLM Fine-tuning for Solving Math Problems

The paper "Improving LLM Fine-tuning for Solving Math Problems" addresses the challenge of enhancing the mathematical problem-solving capabilities of LLMs such as PaLM 2 and GPT-4. Although these models have demonstrated substantial competence in various natural language processing tasks, they still struggle significantly with mathematical reasoning and computation.

Given the capability gap between the pass-at-one (single attempt accuracy) and pass-at-N (multiple attempt accuracy) performance in solving math problems, the authors focus on fine-tuning strategies to optimize this process. Specifically, the paper explores three distinct strategies:

Solution Fine-tuning: Fine-tuning the LLMs to generate step-by-step solutions to math problems. This method leverages the benefit from detailed mathematical reasoning elicited during training.
Solution-cluster Re-ranking: This strategy enhances the model's solution evaluation ability by not only generating candidate solutions but also assessing them. By clustering equivalent solutions and applying evaluative reranking, the approach effectively incorporates both majority voting and re-ranking advantages.
Multi-task Sequential Fine-tuning: The integration of solution generation and evaluation tasks in a sequential approach aims to improve the overall performance by borrowing beneficial aspects from both task objectives.

The implementations and experiments are conducted on the MATH dataset with the utilization of the PaLM 2 models, both small and large variants. Results indicate significant findings:

Quality of Solutions: The performance improvement is contingent upon the quality and granularity of the solutions used for fine-tuning. Models fine-tuned with more structured and detailed solutions (such as those generated by GPT-4) outperform those using only the dataset's original, more abstract solutions.
Solution Re-ranking and Majority Voting: While re-ranking or majority voting independently enhances performance, combining them results in superior outcomes. The re-ranking strategy that focuses on most-frequent solution clusters proves both effective and computationally economical.
Multi-task Fine-tuning Advantage: The advantages of training the model for both solution generation and evaluation are realized, demonstrating an enhanced capacity for problem-solving by leveraging evaluation-oriented training signals.

The empirical evaluation establishes that fine-tuning using the proposed strategies notably improves the LLMs' performance in solving math problems over pre-trained models. Specifically, a noteworthy 11.2% accuracy gain to 58.8% on the MATH dataset was achieved using fine-tuned PaLM 2-L models, compared to few-shot pre-trained models.

Implications: These findings have practical implications for enhancing LLM utility in mathematically intensive applications, suggesting pathways for obtaining more robust solutions via fine-tuning. Theoretically, this work poses intriguing questions about task-specific adaptation and the architecture of neural networks with respect to outperforming traditional training methodologies.

Future Directions: This paper potentially opens up new avenues for further research, including the exploration of automated solution quality assessments without relying on external evaluations, and the application of multi-modal approaches that integrate symbolic computation tools with LLMs to strengthen mathematical problem-solving performance. Additionally, scalability of these methods to more diversified and complex datasets remains a prospective area of interest.