Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (2312.06585v4)

Published 11 Dec 2023 in cs.LG

Abstract: Fine-tuning LLMs~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.

PDF HTML Abstract

Introduction

The enhancement of LLMs (LMs) traditionally relies on a corpus of human-generated data for fine-tuning, improving their performance on various tasks. However, the availability and quality of such data is a limiting factor in model development. This paper investigates an alternative approach using model-generated data bolstered by scalar feedback, such as binary correctness indicators in math problems.

Self-Training with ReST-EM

The core of the paper is the self-training method, dubbed ReST-EM, based on the well-established expectation-maximization algorithm. This technique involves a two-step process per iteration: generating data samples using the model and filtering them based on provided feedback, then fine-tuning the model on these filtered samples. The process iterates, building on the enhanced performance achieved in each step.

When applied to sophisticated math reasoning and coding problems using different scales of the PaLM-2 model, the results indicate that ReST-EM is a scalable method that significantly outperforms traditional fine-tuning on human data. This suggests a potential for reduced reliance on human-generated datasets in LLM training.

Preliminaries and Methodology

In its essence, an autoregressive LLM predicts text sequences one token at a time. Reinforcement learning (RL) approaches typically involve tuning models using a reward function. However, the computational cost of fine-tuning a model via RL is substantial. ReST-EM offers a solution to this by decoupling data generation and policy optimization, allowing for easier scaling.

The paper details the ReST-EM algorithm, emphasizing the iterative nature of the Generate and Improve steps in self-training, which are designed to refine the policy network responsible for the model's output samples. The key novelty lies in fine-tuning the model based on a reward function reflecting the quality of the output, which can lead to self-improvement over iterations.

Empirical Findings and Analysis

Significant advancements in performance have been achieved on challenging benchmarks in math problem-solving and code generation when using ReST-EM. The method shines when applied iteratively, showing substantial improvements with each cycle, although it's crucial to monitor for overfitting.

The paper further reveals that fine-tuned models not only excel in their respective training tasks but also demonstrate enhanced capabilities in related areas. Ablation studies within the paper suggest that the effectiveness of ReST-EM scales positively with the amount of model-generated solutions and the size of training problems. The findings underline the method's potential as a highly efficient way to improve LLMs.

Conclusion

In conclusion, the work presents a formidable technique for enhancing LMs without heavy reliance on human data. ReST-EM stands out as a promising route for LLM advancement, potentially alleviating the bottleneck of high-quality data scarcity. The paper serves as a foundation for future exploration into the self-improvement of LLMs, aspiring to further reduce human data dependence and improve computational efficiency.