Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment (2408.06266v5)

Published 12 Aug 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces CLAIR, a contrastive learning method that revises preference pairs for clearer training signals in LLM alignment.
The paper presents APO, an optimization approach that stabilizes training dynamics and outperforms traditional objectives like DPO.
The paper demonstrates a 7.65% improvement on the MixEval-Hard benchmark with Llama-3-8B-Instruct, reducing the gap with GPT-4-turbo by 45%.

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

The paper, "Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment," explores the intricacies of aligning LLMs using preference datasets and alignment objectives. The paper identifies and addresses underspecification issues prevalent in conventional alignment methods, leading to two primary contributions: Contrastive Learning from AI Revisions (CLAIR) and Anchored Preference Optimization (APO).

Summary of Findings

The research focuses on two main aspects of the alignment process:

Contrastiveness of Preference Data: Traditional preference datasets often incorporate spurious differences, making it challenging for models to discern the most relevant signals for improvement. The paper introduces CLAIR, a method that generates minimally contrasting preference pairs by revising outputs to enhance clarity, correctness, and engagement, providing a more precise learning signal.
Training Dynamics of Alignment Objectives: Existing objectives such as Direct Preference Optimization (DPO) can produce ambiguous training dynamics, sometimes increasing the likelihood of suboptimal outputs. The proposed APO framework offers fine-grained control over likelihood changes during training, improving the stability and effectiveness of the alignment process.

Experimental Validation

The authors conducted comprehensive experiments aligning Llama-3-8B-Instruct across several datasets and alignment objectives, measuring performance using the MixEval-Hard benchmark—a reliable proxy for human judgments.

Key Results

Contrastive Data Generation: CLAIR-generated preferences led to the highest performance improvements. Specifically, models trained with CLAIR and APO-zero achieved a 7.65% improvement on MixEval-Hard, reducing the performance gap with GPT-4-turbo by 45%.
Training Objectives: APO variants consistently outperformed DPO and other less controllable objectives. APO-zero performed best when winning outputs were generally better than the model, while APO-down was more effective when winning outputs were of lower quality.

Implications and Speculations on Future Developments

Practical Implications

The paper’s contributions have significant implications for both research and applications of LLMs:

Data Collection and Annotation: The insights from CLAIR can inform better practices in generating preference datasets, leading to more effective use of annotation resources. This method is particularly beneficial when optimal data creation is crucial, as in specialized or sensitive domains.
Model Training and Fine-Tuning: APO provides a nuanced approach to updating models, ensuring that training dynamics are well-aligned with the quality of the data. This can lead to more robust and reliable models, particularly relevant for applications requiring high precision and alignment with human values.

Theoretical Implications

The findings also contribute to the theoretical understanding of model alignment:

Contrastiveness as a Metric: The notion of contrastiveness introduced by CLAIR could be further explored as a metric for evaluating and generating training data, potentially leading to new methodologies and metrics in the field of machine learning.
Anchoring in Optimization: APO's approach to anchoring likelihood changes based on model-data relationships could inspire novel alignment objectives and optimization algorithms. This perspective could be extended to other domains where model alignment with human preferences is critical, such as recommendation systems and autonomous agents.

Conclusion

The paper makes significant strides in the alignment of LLMs by addressing underspecification issues through CLAIR and APO. These contributions enhance the precision and stability of model alignment, showing promising results in empirical evaluations. The work encourages further exploration of contrastiveness and anchoring in alignment objectives, with potential applications spanning multiple AI research areas. Future research could build on these foundations to refine alignment techniques and develop more sophisticated models that better adhere to human values and preferences.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (8)

Tweets

https://twitter.com/_philschmid/status/1827002358858994111

https://twitter.com/KarelDoostrlnck/status/1823387759924646167

https://twitter.com/casper_hansen_/status/1860244072838967351

https://twitter.com/kalomaze/status/1824135114177253430

https://twitter.com/aguspiql/status/1828823327693246624

https://twitter.com/davidberenstei/status/1846173550836621599

YouTube

Show All Videos