- The paper introduces Unsupervised Prefix Fine-Tuning (UPFT), leveraging consistency in early reasoning steps to improve LLM reasoning without labeled data.
- UPFT fine-tunes models on early reasoning path prefixes, guided by a theoretical framework optimizing both the coverage and accuracy of potential solutions.
- Empirical results show UPFT improves performance on complex reasoning tasks, is data-efficient, and competes with supervised methods using fewer tokens.
The paper introduces Unsupervised Prefix Fine-Tuning (UPFT), a novel method to improve the reasoning capabilities of LLMs without relying on labeled data or computationally intensive sampling techniques. UPFT leverages the principle of Prefix Self-Consistency, which posits that the initial reasoning steps of diverse solution trajectories for a given problem tend to be highly consistent. By fine-tuning the LLM on these initial prefix substrings, the method aims to guide the model's inherent reasoning structures toward more systematic solutions.
The key contributions highlighted in the paper are:
- The identification of Prefix Self-Consistency.
- The introduction of the UPFT method.
- Empirical validation demonstrating UPFT's data efficiency and versatility.
The paper empirically validates the existence of Prefix Self-Consistency by demonstrating that early reasoning steps are highly consistent across trajectories and that errors predominantly occur in later reasoning steps. To demonstrate these characteristics, the authors conduct experiments using 500 questions randomly sampled from the PRM training dataset and report results on the math-specialized Qwen-Math-7B-Instruct and the general-purpose Llama-3.1-8B-Instruct.
The methodology section details the UPFT approach, beginning with a discussion of Rejection Sampling Fine-Tuning (RFT) as a baseline. The authors present a Bayesian framework to model the coverage and accuracy of reasoning traces. This perspective leads to the insight that maximizing the probability of a correct answer involves enhancing both the coverage of effective reasoning traces and the accuracy of each trace. The paper then introduces the concept of prefix spans, mathematically formalizing the trade-off between prefix coverage and prefix accuracy using the chain rule for conditional probabilities, where:
logp(y∣x)≥Er<t∼p(⋅∣x)[L(r<t,x)]
where:
- p(y∣x) is the conditional probability of answer y given input x
- r<t denotes the prefix of r before time step t
- p(r<t∣x) is the distribution of prefixes
- L(r<t,x) is the conditional lower bound given the prefix r<t
The UPFT method maximizes reasoning trace coverage while maintaining accuracy by learning from the prefix of the reasoning trace. To prevent catastrophic forgetting, the authors adopt a multi-task learning approach, combining prefix tuning with a conventional SFT process.
The paper details the experimental setup, including the selection of backbone LLMs (Llama-3.1-8B-Instruct, Qwen2.5-Math-7B-Instruct, and DeepSeek-R1-Distill-Qwen-7B), fine-tuning datasets (PRM, OMI2, LIMO, and U-Hard), benchmarks (GSM8K, MATH500, AIME24, and GPQA Diamond), and tuning scenarios (unsupervised and supervised sampling). The hyperparameter settings used for the experiments are also specified.
An ablation paper is conducted to evaluate the impact of prefix length and structure tuning ratio on reasoning accuracy. The results show that each model has an optimal prefix length and structure tuning ratio for peak performance.
The paper presents results for both unsupervised and supervised fine-tuning scenarios. In the unsupervised setting, UPFT consistently outperforms conventional SFT across various datasets and models. The benefits of UPFT are more pronounced on complex reasoning tasks. In the supervised setting, UPFT achieves competitive performance with methods like RFT and V-STaR, while requiring significantly fewer tokens.
The paper also discusses related work in the areas of prompting techniques, verification of output trajectories, self-training, and self-improvement. The authors differentiate their approach by emphasizing the use of latent reasoning structures acquired during pre-training and the exploitation of Prefix Self-Consistency.
The conclusion summarizes the key findings, highlighting the potential of minimal unsupervised fine-tuning to improve LLM reasoning abilities without external supervision or extensive computational resources. Future research directions include applying UPFT to other tasks and further investigating the theoretical foundations of Prefix Self-Consistency.