Accelerating Direct Preference Optimization with Prefix Sharing (2410.20305v2)

Published 27 Oct 2024 in cs.LG and cs.CL

Abstract: Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves $1.1$-$1.5\times$ improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent $1.3$-$1.6\times$ speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes. We open-source our code at https://github.com/frankxwang/dpo-prefix-sharing.

Summary

The paper introduces prefix sharing to eliminate redundant computations in paired preference optimization, significantly boosting training throughput.
It combines a custom block-sparse attention mask with sequence packing to maintain log probability integrity and computational efficiency.
Empirical tests show speedups of up to 1.6 times on datasets with long shared prefixes, enabling scalable fine-tuning for large language models.

The paper "Accelerating Direct Preference Optimization with Prefix Sharing" addresses the computational inefficiencies associated with traditional implementations of offline paired preference optimization algorithms, particularly for tasks with extensive shared prompts. The authors propose a novel method termed "prefix sharing" aimed at enhancing the training throughput of Direct Preference Optimization (DPO) and similar methodologies without affecting convergence or increasing memory consumption substantially.

Core Methodology

Traditional paired preference optimization involves processing each prompt twice for both the chosen and rejected responses in a sequence, leading to redundant computations. The innovation presented in this paper is the processing of chosen and rejected responses as a single sequence with a shared prefix using a custom block-sparse attention mask to prevent cross-response contamination. This approach effectively reduces the computational load by eliminating the redundant processing of shared prompts.

The authors leverage PyTorch’s FlexAttention to implement a custom attention mask that ensures each response is independently computed, thereby preserving the integrity of log probability calculations necessary for DPO. The paper's methodology also extends to sequence packing, an efficiency technique that packs multiple sequences into a single batch, thereby optimizing the utilization of computational resources.

Computational Results

Empirical results demonstrated on various datasets reveal substantial improvements in training throughput. Datasets with high prefix-to-completion ratios and longer sequences experience speedups ranging from 1.1 to 1.5 times without prefix sharing, and approximately 1.3 to 1.6 times with the inclusion of sequence packing. This is particularly evident in datasets like multi-step reasoning and summarization, where prompts tend to be significantly longer than individual responses. For instance, the Capybara dataset exhibited a 1.42 to 1.54 times speedup over FlashAttention-3 when combined with prefix sharing.

Despite the inherent slowness of FlexAttention compared to FlashAttention-3, the prefix sharing technique, especially when used with sequence packing, mitigates this performance deficit and often outperforms FlashAttention-3 in real-world full training scenarios.

Broader Implications and Future Directions

This research provides a practical avenue for making preference-based LLM fine-tuning accessible over a wider range of applications and model sizes. By optimizing computational efficiency, the proposed method supports the scalability of DPO and associated preference optimization algorithms, enhancing their applicability in tasks ranging from instruction-following and agentic planning to complex code reasoning. Moving forward, it would be pertinent to explore the adaptability of prefix sharing across other fine-tuning techniques beyond DPO, potentially paving the way for broader optimization strategies within preference learning frameworks.

Moreover, extending the prefix sharing concept to further optimize LLM decoding during inference is an intriguing direction for future work. Integrating this method with more optimized kernel implementations could potentially unlock additional computational efficiencies, driving even broader applications in real-time and resource-constrained environments.

In conclusion, the paper's contributions lie in its strategic enhancement of the computational throughput of preference optimization algorithms through a novel architectural design, offering a tangible step towards more efficient and scalable machine learning solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1852842422734320087

https://twitter.com/arXivGPT/status/1852118776767987805