Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (2312.06585v4)

Published 11 Dec 2023 in cs.LG
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Abstract: Fine-tuning LLMs~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST${EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST${EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.

Introduction

The enhancement of LLMs (LMs) traditionally relies on a corpus of human-generated data for fine-tuning, improving their performance on various tasks. However, the availability and quality of such data is a limiting factor in model development. This paper investigates an alternative approach using model-generated data bolstered by scalar feedback, such as binary correctness indicators in math problems.

Self-Training with ReST-EM

The core of the paper is the self-training method, dubbed ReST-EM, based on the well-established expectation-maximization algorithm. This technique involves a two-step process per iteration: generating data samples using the model and filtering them based on provided feedback, then fine-tuning the model on these filtered samples. The process iterates, building on the enhanced performance achieved in each step.

When applied to sophisticated math reasoning and coding problems using different scales of the PaLM-2 model, the results indicate that ReST-EM is a scalable method that significantly outperforms traditional fine-tuning on human data. This suggests a potential for reduced reliance on human-generated datasets in LLM training.

Preliminaries and Methodology

In its essence, an autoregressive LLM predicts text sequences one token at a time. Reinforcement learning (RL) approaches typically involve tuning models using a reward function. However, the computational cost of fine-tuning a model via RL is substantial. ReST-EM offers a solution to this by decoupling data generation and policy optimization, allowing for easier scaling.

The paper details the ReST-EM algorithm, emphasizing the iterative nature of the Generate and Improve steps in self-training, which are designed to refine the policy network responsible for the model's output samples. The key novelty lies in fine-tuning the model based on a reward function reflecting the quality of the output, which can lead to self-improvement over iterations.

Empirical Findings and Analysis

Significant advancements in performance have been achieved on challenging benchmarks in math problem-solving and code generation when using ReST-EM. The method shines when applied iteratively, showing substantial improvements with each cycle, although it's crucial to monitor for overfitting.

The paper further reveals that fine-tuned models not only excel in their respective training tasks but also demonstrate enhanced capabilities in related areas. Ablation studies within the paper suggest that the effectiveness of ReST-EM scales positively with the amount of model-generated solutions and the size of training problems. The findings underline the method's potential as a highly efficient way to improve LLMs.

Conclusion

In conclusion, the work presents a formidable technique for enhancing LMs without heavy reliance on human data. ReST-EM stands out as a promising route for LLM advancement, potentially alleviating the bottleneck of high-quality data scarcity. The paper serves as a foundation for future exploration into the self-improvement of LLMs, aspiring to further reduce human data dependence and improve computational efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Learning to generalize from sparse and underspecified rewards. In International conference on machine learning, pages 130–140. PMLR, 2019.
  2. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  3. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
  4. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023. 10.48550/ARXIV.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  7. P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
  8. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  9. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  10. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  11. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  12. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  13. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021a.
  14. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021b.
  15. Large language models can self-improve. CoRR, abs/2210.11610, 2022. 10.48550/ARXIV.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610.
  16. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
  17. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
  18. Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29, 2016.
  19. OpenAI. Gpt-4 technical report, 2023.
  20. K. Paster. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
  21. J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  22. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  23. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=1PL1NIMMrw.
  24. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  25. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  26. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (41)
  1. Avi Singh (21 papers)
  2. John D. Co-Reyes (16 papers)
  3. Rishabh Agarwal (47 papers)
  4. Ankesh Anand (13 papers)
  5. Piyush Patil (4 papers)
  6. Peter J. Liu (30 papers)
  7. James Harrison (44 papers)
  8. Jaehoon Lee (62 papers)
  9. Kelvin Xu (25 papers)
  10. Aaron Parisi (8 papers)
  11. Abhishek Kumar (171 papers)
  12. Alex Alemi (9 papers)
  13. Alex Rizkowsky (3 papers)
  14. Azade Nova (13 papers)
  15. Ben Adlam (25 papers)
  16. Bernd Bohnet (21 papers)
  17. Hanie Sedghi (35 papers)
  18. Igor Mordatch (66 papers)
  19. Isabelle Simpson (3 papers)
  20. Izzeddin Gur (23 papers)
Citations (93)
Youtube Logo Streamline Icon: https://streamlinehq.com