Self-Refine: Iterative Refinement with Self-Feedback (2303.17651v2)

Published 30 Mar 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Like humans, LLMs do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.

PDF Abstract

Self-Refine: Iterative Refinement with Self-Feedback

The paper "Self-Refine: Iterative Refinement with Self-Feedback" introduces a novel technique aimed at enhancing the performance of LLMs during test time by iteratively refining the generated outputs. This method, termed Self-Refine, leverages the abilities of an LLM to provide feedback on its own generations and utilize this feedback to produce improved outputs over several iterations.

Methodology

Self-Refine operates through an iterative feedback loop involving three key steps:

Initial Generation: An initial output is generated using the LLM.
Feedback: The same LLM analyzes the initial output and provides feedback on specific aspects.
Refinement: The LLM refines the output based on the provided feedback.

This process repeats until a predefined stopping criterion is met, either a fixed number of iterations or a signal indicating no further improvement is necessary. Notably, Self-Refine does not require additional supervised training data or reinforcement learning. Instead, it employs in-context few-shot learning to guide the model on how to generate feedback and refine outputs.

Evaluation

The authors evaluate Self-Refine on seven diverse tasks including dialogue response generation, code optimization, code readability improvement, math reasoning, sentiment reversal, acronym generation, and constrained generation. Strong base LLMs such as GPT-3.5, ChatGPT, and GPT-4 are used as the underlying models for baselines and iterative refinement comparisons.

Results

Across various tasks, Self-Refine consistently outperforms the baseline models:

Dialogue Response Generation: Improvements in GPT-4 response quality show an absolute increase of 49.2% in preference rates compared to one-step generation.
Code Optimization: Self-Refine improves the optimized code percentage in GPT-4 from 27.3% to 36.0%.
Math Reasoning: While gains are modest, improvements in solve rates are observed when coupled with external feedback signals.

The methodology's robustness is demonstrated by achieving significant performance gains across different base models and tasks. Particularly noteworthy are the improvements in multi-aspect feedback tasks like constrained generation and sentiment reversal, showcasing Self-Refine's ability to handle complex and nuanced outputs.

Analysis

The paper provides a comprehensive analysis revealing several key insights:

Quality of Feedback: Targeted, actionable feedback significantly enhances model performance compared to generic or no feedback scenarios.
Iteration Importance: Initial iterations yield substantial improvements, although marginal gains diminish with each subsequent iteration.
Model Capabilities: The success of Self-Refine is linked to the base model's ability to understand and generate high-quality feedback and follow iterative refinement processes.

A qualitative analysis further highlights instances where Self-Refine transforms suboptimal solutions into highly efficient ones through insightful feedback, exemplifying the method's capability to self-improve via iteration.

Implications and Future Work

The implications of Self-Refine extend beyond the field of predefined benchmarks. The paper posits real-world applications such as improving website designs and complex creative tasks, where iterative refinement mimics human creative processes. This approach holds promise for enhancing LLM-assisted tasks in various domains without additional data or training.

Future research directions could explore integrating more sophisticated feedback mechanisms, refining the stopping criteria, and extending the approach to other languages and less powerful models. Ensuring robustness against erroneous feedback and exploring mixed-model refinement strategies also constitute promising research avenues.

In conclusion, Self-Refine showcases a significant advancement in leveraging LLMs' potential by harnessing iterative self-feedback, presenting a versatile and effective approach to enhance LLM performance in diverse tasks without the need for additional supervised training or reinforcement learning. The methodology's simplicity and effectiveness underline its potential as a tool for continuous improvement in language generation tasks.