Self-Boosting LLMs with Synthetic Preference Data
The paper introduces an innovative approach, SynPO, to enhance LLMs by employing synthetic preference data. This mechanism aims to reduce the reliance on costly human-curated datasets by enabling models to self-generate high-quality training data. The paper's purpose is to align LLM outputs with desired human-like preferences, thereby improving model responses.
Key Methodological Contributions
SynPO uniquely addresses the challenge of high-quality preference data scarcity through a self-boosting paradigm. The process involves two primary components: a self-prompt generator and a response improver.
- Self-Prompt Generator: SynPO leverages a self-trained generator that produces diverse prompts from keyword inputs, negating the need for direct human intervention or external model assistance. This generator is trained using a small set of seed data, enabling it to output high-quality synthetic prompts by combining random keywords.
- Response Improver: A distinct addition, the response improver uses the model itself to enhance responses iteratively. This component refines initial model completions to create improved outputs, serving as chosen responses for synthetic preference data. This strategy capitalizes on the model's ability to detect distribution gaps and incrementally adjust its responses.
- Iterative Training and Preference Optimization: SynPO employs an iterative approach where synthetic prompts and refined responses guide the training process. This iterative loop significantly enhances the model's instruction-following capabilities and overall task performance without the necessity of large, annotated preference datasets.
Experimental Results
The paper reports substantial improvements in model performance on multiple benchmarks such as AlpacaEval 2.0 and Arena-Hard. After four iterations, the SynPO-enhanced Mistral-7B and Llama3-8B models demonstrated over a 22% improvement in win rates compared to initial models. Additionally, an increase of 3.2 to 5.0% in average performance was observed on the Open LLM leaderboard.
Notably, this self-boosting mechanism addresses the "alignment tax" issue prevalent in other LLM alignment strategies. By refining model outputs internally, SynPO reduces the model's dependency on external feedback, allowing it to maintain and even enhance its generalist capabilities across diverse tasks.
Implications and Future Directions
The results suggest that SynPO can dramatically shift how LLMs are trained, providing a cost-effective solution for continuous model improvement without extensive human intervention. The approach offers a pathway for models to autonomously adapt to new tasks by learning from synthetic preferences, potentially reducing the gap in model alignment tasks.
Future research might explore integrating SynPO with on-policy optimization techniques to further refine model performance. Additionally, examining the extent to which SynPO can influence various downstream applications could illuminate broader implications for AI development.
Overall, SynPO represents a significant advancement in the field of LLM alignment and exemplifies the potential of synthetic data to drive AI innovation.