Aligning Visual Contrastive learning models via Preference Optimization (2411.08923v3)

Published 12 Nov 2024 in cs.CV and cs.LG

Abstract: Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Preference Optimization (PO) methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align generative models with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using different PO methods to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks and inductive biases, commonly seen in contrastive vision-LLMs like CLIP. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method for tackling typographic attacks on images and explore its ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.

Summary

The paper presents Preference Optimization as a novel method leveraging human feedback to enhance the alignment and fairness of visual contrastive models.
It employs a Markov Decision Process to combine preference and regularization losses, improving resistance against typographic attacks while maintaining accuracy.
Experimental results show improved performance over benchmarks like PAINT, demonstrating significant reduction in gender bias and typographic vulnerabilities.

Aligning Visual Contrastive Learning Models via Preference Optimization

Contrastive learning has become a dominant method for training models that understand semantic similarities between data points by aligning their representations in a shared embedding space. Despite notable successes, these models often suffer from biases and vulnerabilities inherent to the training data. The paper "Aligning Visual Contrastive learning models via Preference Optimization" addresses these issues by introducing Preference Optimization (PO) as an alignment technique, aiming to improve performance and fairness in contrastive models like CLIP.

Preference Optimization Framework

Methodology

Preference Optimization takes advantage of human feedback and preference modeling to align model behavior with desired outcomes. The framework applies principles from Reinforcement Learning from Human Feedback (RLHF) and adapt them to contrastive learning. The primary goals are:

Enhancing robustness against typographic attacks.
Mitigating biases, particularly related to gender understanding.

The paper proposes a Markov Decision Process (MDP) formulation for contrastive learning tasks, treating them as retrieval systems where the alignment task focuses on selecting preferred outputs from a set. The policy optimization utilizes techniques such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) to adjust model behavior effectively.

Figure 1: Overview of our proposed framework, on the left side, we calculate the preference optimization loss $\mathcal{L}_{pref}$ using the preference dataset, and the regulatory loss $\mathcal{L}_{reg}$ using the clean regularization dataset.

Dataset and Training

Two types of datasets are used: a Preference Dataset containing preferred and dispreferred pairs to guide the model towards desired behaviors, and a Regularization Dataset consisting of clean samples to maintain accuracy on non-adversarial data. The overall loss function combines preference optimization with a regularization term to ensure model proximity to its original state, preventing excessive divergence.

Experimental Results

Typographic Attack Robustness

The PO framework was tested on typographic attack datasets using CLIP as the base model. Results show significant improvement over existing models such as PAINT and Defense-prefix. The approach effectively blocks adversarial influences without compromising performance on clean data.

Figure 2: (Left) Accuracy on typographic samples and percentage of typographic label predictions versus transformation scaling factor $t$ . As $t$ increases, the model favors object labels over typographic labels while maintaining accuracy.

Mitigating Gender Bias

The framework achieved high control over gender-specific predictions, illustrating its ability to disentangle gender biases effectively. Adjustments are managed using transformation scaling, demonstrating a capacity to reverse or neutralize biases in model decision-making.

Figure 3: (Left) Image input. (Center) Comparison of model predictions before and after applying our gender-flipping method, showing changes in the predicted gender probabilities.

Conclusions

This research presents Preference Optimization as a promising technique for improving contrastive learning models where biases and adversarial vulnerabilities are concerns. The practical application of PO to tasks such as typographic attacks and gender bias illustrates its versatility and efficacy. Future research could explore extending these alignment methods to other data modalities, seeking further fairness and robustness improvements.

Overall, the paper's approach demonstrates how preference-based optimization can align models with human values while retaining previously learned knowledge and capabilities, laying groundwork for more resilient and socially aware AI systems.