Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discovering Preference Optimization Algorithms with and for Large Language Models (2406.08414v3)

Published 12 Jun 2024 in cs.LG

Abstract: Offline preference optimization is a key method for enhancing and controlling the quality of LLM outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.

Citations (5)

Summary

  • The paper presents a novel LLM-driven pipeline that automatically discovers new preference optimization algorithms, reducing reliance on human-crafted loss functions.
  • It employs an iterative meta-optimization process that evaluates generated loss functions on benchmarks like MT-Bench, TL;DR, and IMDb.
  • Results show that the discovered algorithm, DiscoPOP, significantly outperforms traditional methods, yielding higher win rates and enhanced task performance.

Discovering Preference Optimization Algorithms With and For LLMs

Introduction

The paper presents a novel approach to preference optimization algorithms in the context of LLMs. The traditional mechanism of preference optimization relies heavily on manually crafted loss functions that are constrained by human ingenuity and creativity. This limitation has fueled the imperative to explore larger search spaces for potential optimizations. The authors propose a methodology that employs LLMs to automatically discover new preference optimization algorithms, which are subsequently evaluated based on their performance.

Background

Preference optimization in LLMs typically operates in two phases: pre-training on large text corpora and fine-tuning to align outputs with human preferences. Conventional methods often depend on reinforcement learning with human feedback (RLHF) and offline preference optimization algorithms like direct preference optimization (DPO) and sequence likelihood calibration (SLiC). While these have seen success, they still rely on predefined convex loss functions and suffer from inherent limitations. The paper ventures into this territory by proposing automated discovery of such objective functions through iterative prompting of an LLM.

Methodology

The approach involves the development of an LLM-driven discovery pipeline. The LLM iteratively generates new loss functions based on prior evaluation metrics, allowing the discovery of new, high-performing algorithms with minimal human intervention. The generated loss functions are scrutinized for validity via unit tests, and those passing are tested for performance on MT-Bench scores. This process is inherently meta-optimization: optimizing the optimization procedure itself using an LLM.

Discoveries and Experiments

This methodology led to the discovery of several novel loss functions, with Discovered Preference Optimization (DiscoPOP) emerging as one of the most promising. DiscoPOP adapts a blend of logistic and exponential losses. Among the discovered algorithms, DiscoPOP demonstrated exceptional performance across various tasks, validating its potential as a robust preference optimization algorithm.

The authors conducted numerous experiments to evaluate the efficacy of discovered loss functions. The tasks included multi-turn dialogue on MT-Bench, summarization using the TL;DR dataset, and controlled sentiment generation using the IMDb dataset. The results indicated that DiscoPOP and other high-performing discovered objectives outperformed traditional algorithms like DPO and SLiC in many respects.

Results

Of note are the specific numerical results showing that DiscoPOP achieved a win rate improvement on held-out tasks such as Alpaca Eval 2.0, demonstrating a substantive enhancement from 11.23% to 13.21% in win rates against GPT-4. Additionally, in summarization and controlled sentiment generation tasks, DiscoPOP continued to outperform existing algorithms.

Implications

Practically, this research underscores the capabilities of LLMs not only as subjects to be optimized but also as agents capable of generating new optimization strategies. Theoretically, these findings extend the understanding of how automated systems can transcend human-imposed constraints, exploring more extensive and complex search spaces autonomously. The success of DiscoPOP suggests a future where optimization algorithms may become increasingly fluid and adaptive, learning and evolving in tandem with the models they are designed to optimize.

Future Directions

The paper opens avenues for several future research directions. One immediate extension could involve further refining the discovery process by introducing more sophisticated heuristics or meta-meta optimization techniques. Another route is expanding the types of tasks and evaluation metrics to validate the generalizability of discovered algorithms more comprehensively. Further, incorporating multi-objective optimization could yield loss functions that balance competing criteria such as performance and computational efficiency.

Conclusion

This work represents a significant step in the automated discovery of optimization strategies for LLMs. By leveraging the generative capabilities of LLMs, the authors have demonstrated the potential to mitigate human constraints in algorithm design. DiscoPOP and other discovered algorithms not only outperformed traditional methods but also introduced new perspectives on how optimization can be approached in practice. This fusion of automation and machine learning signifies profound advancements in the field of AI, pointing towards a future where machines contribute more extensively to their own development.

Youtube Logo Streamline Icon: https://streamlinehq.com