Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels (2404.14313v2)

Published 22 Apr 2024 in cs.CL

Abstract: When prompting a LLM (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained LLM (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

PDF Abstract

Self-Supervised Alignment with Mutual Information (SAMI): Teaching Pretrained LLMs to Adhere to Behavioral Principles Without Human Labels

Introduction

Introducing Self-Supervised Alignment with Mutual Information (SAMI), a method designed to teach a pretrained LLM (LM) to autonomously align with specified behavioral principles or constitutions through an iterative finetuning process. This procedure utilizes an innovative approach that foregoes the need for human-generated preference labels, demonstrations, or direct oversight, thereby addressing the complexity and resource-intensive nature of conventional alignment methods.

Methodology: SAMI

SAMI enhances the alignment of LMs by increasing the conditional mutual information between constitutions—behavioral principles expressed in natural language—and model-generated responses. The method operates through an iterative loop involving three primary stages:

Principle Generation: Utilizing an LM, referred to as the "principle writer," to generate behavioral principles, subsequently used to sample constitutions.
Response Sampling: Pairing of constitutions with queries to elicit responses from the target LM, which is subject to finetuning.
Optimization: Employing a specially designed loss function that facilitates the maximization of mutual information between the responses and the corresponding constitutions based on sampled queries.

Key to the methodology is the optimization of a conditional mutual information lower bound, leveraging InfoNCE with an optimal critic to provide a stable estimator. This is achieved without the typical reliance on preference labels or demonstrations, marking a significant deviation from traditional approaches like Reinforcement Learning from Human Feedback (RLHF) or Supervised Fine-Tuning (SFT).

Experimental Setup and Results

Datasets and Models:

Datasets: Dialogues from HH-RLHF and summaries from TL;DR datasets were utilized.
Models: SAMI employed models like mistral-7b and mixtral-8x7b, with constitutions generated by both strong and weak principle writer models, such as claude-opus.

Performance Analysis:

In comparison tests against both the initial and instruction-finetuned models, SAMI-trained LMs displayed superior performance in dialogues and summaries.
Numerical results highlighted a win rate improvement from initial models to SAMI-trained models, with respective rates of 66% and 77% experienced in dialogue tasks, and up to 65% in summarization tasks when compared to strong baseline models.

These results exemplify the capability of SAMI to refine the base model's inherent behavior distribution towards desired principles under strong model guidance.

Theoretical Implications and Practical Applications

SAMI underscores the inherent potential within pretrained models to be aligned according to specified behavioral principles without direct human intervention, exposing the underlying statistical connections that can be optimized through iterative self-supervised learning. From a practical standpoint, SAMI offers a scalable, less resource-intensive alternative to traditional model training and aligning methodologies, potentially broadening the applicability and deployment of LMs in varied real-world scenarios where adherence to ethical guidelines and user preferences is critical.

Future Directions

Further research could explore expanding SAMI's applicability across more diverse sets of principles and queries, increasing robustness against potential biases or alignment errors. Investigating other regularization techniques to mitigate model degradation ("gibberish" outputs) and addressing length bias in model responses could enhance the method's effectiveness and reliability.

Conclusion

SAMI shapes an innovative pathway in the field of LLM alignment by enabling LMs to understand and adhere to behavioral principles autonomously. This method could facilitate broader, more ethical applications of LMs without extensive human data labeling, thus representing a significant step forward in the domain of AI alignment.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jan-Philipp Fränken (12 papers)
Eric Zelikman (20 papers)
Rafael Rafailov (37 papers)
Kanishk Gandhi (20 papers)
Tobias Gerstenberg (18 papers)
Noah D. Goodman (83 papers)

Citations (8)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/jphilippfranken/status/1782592182005989554

https://twitter.com/TheodoreGalanos/status/1782916804530323714