Self-Supervised Alignment with Mutual Information (SAMI): Teaching Pretrained LLMs to Adhere to Behavioral Principles Without Human Labels
Introduction
Introducing Self-Supervised Alignment with Mutual Information (SAMI), a method designed to teach a pretrained LLM (LM) to autonomously align with specified behavioral principles or constitutions through an iterative finetuning process. This procedure utilizes an innovative approach that foregoes the need for human-generated preference labels, demonstrations, or direct oversight, thereby addressing the complexity and resource-intensive nature of conventional alignment methods.
Methodology: SAMI
SAMI enhances the alignment of LMs by increasing the conditional mutual information between constitutions—behavioral principles expressed in natural language—and model-generated responses. The method operates through an iterative loop involving three primary stages:
- Principle Generation: Utilizing an LM, referred to as the "principle writer," to generate behavioral principles, subsequently used to sample constitutions.
- Response Sampling: Pairing of constitutions with queries to elicit responses from the target LM, which is subject to finetuning.
- Optimization: Employing a specially designed loss function that facilitates the maximization of mutual information between the responses and the corresponding constitutions based on sampled queries.
Key to the methodology is the optimization of a conditional mutual information lower bound, leveraging InfoNCE with an optimal critic to provide a stable estimator. This is achieved without the typical reliance on preference labels or demonstrations, marking a significant deviation from traditional approaches like Reinforcement Learning from Human Feedback (RLHF) or Supervised Fine-Tuning (SFT).
Experimental Setup and Results
Datasets and Models:
- Datasets: Dialogues from HH-RLHF and summaries from TL;DR datasets were utilized.
- Models: SAMI employed models like mistral-7b and mixtral-8x7b, with constitutions generated by both strong and weak principle writer models, such as claude-opus.
Performance Analysis:
- In comparison tests against both the initial and instruction-finetuned models, SAMI-trained LMs displayed superior performance in dialogues and summaries.
- Numerical results highlighted a win rate improvement from initial models to SAMI-trained models, with respective rates of 66% and 77% experienced in dialogue tasks, and up to 65% in summarization tasks when compared to strong baseline models.
These results exemplify the capability of SAMI to refine the base model's inherent behavior distribution towards desired principles under strong model guidance.
Theoretical Implications and Practical Applications
SAMI underscores the inherent potential within pretrained models to be aligned according to specified behavioral principles without direct human intervention, exposing the underlying statistical connections that can be optimized through iterative self-supervised learning. From a practical standpoint, SAMI offers a scalable, less resource-intensive alternative to traditional model training and aligning methodologies, potentially broadening the applicability and deployment of LMs in varied real-world scenarios where adherence to ethical guidelines and user preferences is critical.
Future Directions
Further research could explore expanding SAMI's applicability across more diverse sets of principles and queries, increasing robustness against potential biases or alignment errors. Investigating other regularization techniques to mitigate model degradation ("gibberish" outputs) and addressing length bias in model responses could enhance the method's effectiveness and reliability.
Conclusion
SAMI shapes an innovative pathway in the field of LLM alignment by enabling LMs to understand and adhere to behavioral principles autonomously. This method could facilitate broader, more ethical applications of LMs without extensive human data labeling, thus representing a significant step forward in the domain of AI alignment.