- The paper introduces the Pairwise Cringe Loss as an enhancement over binary feedback to optimize LLM performance using pairwise preference data.
- It employs a soft margin based on token-level probabilities, achieving superior content quality and reduced repetition in experiments like AlpacaFarm.
- The iterative training approach with a reward model enables continuous data refinement, offering a simple, efficient method for advanced LLM alignment.
Introduction to Pairwise Cringe Loss
The domain of LLM alignment has incorporated various approaches to optimize performance based on different types of feedback data. An established technique for handling binary feedback – discerning good from bad model responses – has been enhanced to accommodate pairwise preferences, where one model response is chosen over another for a given input. This progression is nurtured by the Pairwise Cringe Loss, a method building upon a known binary feedback strategy commonly referred to as the Cringe Loss.
Binary Feedback and Its Extension
Initially, the Cringe Loss method was tailored for binary feedback. This mechanism applies a standard training loss for acceptable examples and a contrasting loss for weaker examples, reducing their likelihood as top-sequence candidates. Iteration further refines model performance by using the model to label new data iteratively. Despite its efficacy with binary feedback, the applicability and prevalence of pairwise preference data for training LLMs necessitate an adaptable method. Consequently, the Pairwise Cringe Loss was developed, implementing a soft margin that activates or deactivates depending on the probability gap between a preferred and a less preferred response generated by the model. This hybrid loss not only works on the level of entire sequences but also considers individual token probabilities.
Through experiments, the Pairwise Cringe Loss was contrasted with existing standard binary feedback implementations like the original Cringe Loss and others, such as PPO and DPO. It displayed superiority in minimizing repetitions, a trait of LLMs, and demonstrated a higher quality of generated content. When tested on a benchmark known as AlpacaFarm, it excelled in generating model responses that follow given instructions, surpassing several state-of-the-art methods. A pivotal observation is the method's improvement through iterative training. Using a reward model, new responses are generated and assessed to form updated training data, which is then used in subsequent training iterations.
The primary takeaway is that the Pairwise Cringe Loss presents a significant advancement for training instruction-based LLM tasks. This method is not only simple and efficient but exhibits robust performance when benchmarked against leading alternatives. It shows adaptability for potential usage alongside binary feedback, by combining the binary Cringe loss with the Pairwise Cringe loss for diverse data types. The Pairwise Cringe Loss thus stands as a compelling candidate for future LLM training and alignment endeavours.