Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning (2310.11716v1)

Published 18 Oct 2023 in cs.CL

Abstract: Recent advancements in LLMs have expanded the horizons of natural language understanding and generation. Notably, the output control and alignment with the input of LLMs can be refined through instruction tuning. However, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading LLM outputs. We propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. Extensive experiments on widely used evaluation benchmarks show that LLMs trained with our recycled data outperform those trained with existing datasets in various benchmarks.

PDF Abstract

Reflection-Tuning: Data Recycling for Enhanced Instruction-Tuning of LLMs

The paper "Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning" introduces an innovative approach to enhance the instruction-following capabilities of LLMs through a method called reflection-tuning. This method utilizes an LLM's intrinsic self-improvement and judging abilities to recycle and refine the existing instruction-tuning datasets, thereby improving the overall quality of the data used for training.

Methodology

The core idea behind reflection-tuning lies in its unique two-phase process: instruction reflection and response reflection. The authors propose using an oracle model (e.g., ChatGPT) to introspect and refine instruction-response pairs from an original dataset by applying specific reflection criteria. These criteria are crucial for improving both the complexity and relevance of instructions and responses generated by the LLM.

During the instruction reflection phase, the oracle model evaluates the instruction-response pairs against defined criteria such as topic complexity and the level of detail required. The model then generates an improved pair by considering the specific feedback it has provided during the evaluation.

Similarly, in response reflection, the model re-evaluates the response generated in the previous phase based on criteria such as helpfulness, relevance, and accuracy. The result is a refined response, which, together with the modified instruction, forms the recycled dataset used for instruction-tuning the LLM.

Experimental Results

In extensive experiments, models trained on the recycled datasets demonstrated superior performance across multiple benchmarks. Notably, the recycled models outperformed their counterparts that were trained on unmodified datasets from the Alpaca and WizardLM data sources. For instance, the recycled WizardLM 7B model achieved the highest win rate among the compared open-source 7B models in the Alpaca-Eval leaderboard, with win rates of 88.75% and 81.25% on the Vicuna test set, respectively.

Further analysis reveals statistical improvements in various aspects, including the coherence between instructions and responses, the level of detail in responses, and the overall instruction-following difficulty score. These findings underscore the efficacy of reflection-tuning in generating high-quality instruction-tuning data.

Implications and Future Directions

The reflection-tuning method represents a promising approach to address the challenges of data quality in instruction tuning for LLMs. By autonomously refining existing datasets, this method circumvents the need for exhaustive manual curation or additional model training, offering a scalable solution adaptable to various LLM architectures. The enhanced performance of models trained on recycled data highlights the potential for reflection-tuning to improve the robustness and reliability of LLM outputs, thereby increasing the models' practical applicability in diverse natural language generation tasks.

Future research could explore the integration of reflection-tuning with emerging model architectures and training paradigms, potentially investigating its effects in asymmetric instruction settings or incorporating it with Reinforcement Learning from Human Feedback (RLHF) approaches. The flexibility of this method suggests its applicability in optimizing not just LLMs but other AI systems reliant on instruction-tuning processes.

In conclusion, reflection-tuning provides a significant advancement in the field of LLM instruction tuning, emphasizing the utility of high-quality data recycling in enhancing model instruction-following capabilities. This approach offers an effective strategy for overcoming the limitations posed by low-quality datasets, thus paving the way for more reliable and efficient LLMs.