STaR: Bootstrapping Reasoning With Reasoning (2203.14465v2)

Published 28 Mar 2022 in cs.LG, cs.AI, and cs.CL

Abstract: Generating step-by-step "chain-of-thought" rationales improves LLM performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing LLM rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art LLM on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.

PDF Abstract

Self-Taught Reasoner: Bootstrapping LLM Reasoning Capabilities

The paper entitled "STaR: Self-Taught Reasoner – Bootstrapping Reasoning with Reasoning" by Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman, presents a novel methodology for enhancing the reasoning abilities of pre-trained LLMs. Since generating step-by-step rationales ("chain-of-thought") has exhibited significant improvements in solving complex reasoning tasks, the STaR approach aims to leverage a small initial set of rationale examples and a large dataset without explicit rationales to iteratively improve a model's reasoning capability.

Methodological Overview

The core insight of STaR is an iterative bootstrapping mechanism that enhances a model's ability to reason by fine-tuning it on rationales it generates for itself. The process involves:

Few-shot prompting the model with a small number of rationale examples.
Generating rationales and answers for a large dataset.
Filtering out incorrect answers and fine-tuning the model on correct rationales.
Using rationalization to generate rationales for incorrect answers by providing the correct answer in the model's prompt.

The algorithm iteratively repeats these steps, thereby progressively improving the model's reasoning capabilities without extensive manual annotation of rationale datasets.

Technical Insights and Results

Experimental Protocol and Models

The authors used GPT-J (6B parameters) as the base LLM for their experiments, utilizing datasets from various domains including arithmetic, commonsense reasoning, and grade school math word problems. The rationale generation bootstrapping technique was evaluated against conventional few-shot and direct fine-tuning baselines.

Performance Evaluation

STaR demonstrated significant improvements across different datasets:

Arithmetic Problems: The final accuracy achieved was 89.5%, compared to a 76.3% accuracy baseline. The use of rationalization notably accelerated training and enabled the model to handle increasingly complex tasks.
CommonsenseQA (CQA): Performance for STaR with rationalization was 72.5%, comparable to a 30× larger GPT-3 model, and superior to the few-shot and direct fine-tuning baselines.
GSM8K Word Problems: STaR achieved notable improvements, reaching a 10.7% accuracy, which was significantly higher than the few-shot and direct fine-tuning methodologies.

Mechanistic Insights

STaR can be viewed through the lens of reinforcement learning (RL), where rationale generation can be seen as sampling from a policy to maximize correct answers. Rationalization enriches this process by enabling the model to explore alternative rationale distributions conditioned on correct answers, thus improving problem-solving capabilities iteratively.

Implications and Future Directions

The practical and theoretical implications of STaR are both broad and profound:

Generalization Across Domains: The technique shows potential for generalizing LLM reasoning capabilities across diverse domains without extensive manual tuning.
Self-Improving Systems: STaR exemplifies a methodology by which LLMs can autonomously enhance their reasoning skills, offering a scalable path for continuous model improvement.
Benchmark Outperformance: The approach consistently outperforms traditional few-shot and direct fine-tuning baselines, suggesting that iterative fine-tuning on self-generated rationales can yield significant benefits.

Conclusion

The STaR method presents a sophisticated yet effective technique for enhancing the reasoning capability of LLMs through iterative bootstrapping. By generating and learning from its own rationales, the model can autonomously improve on various reasoning tasks, leading to substantial performance gains without the need for extensive manually-annotated rationale datasets. This approach holds promise for creating more robust, self-improving AI systems, opening avenues for future research into automated reasoning augmentation and the development of more advanced LLMs.