Weighted-Reward Preference Optimization for Implicit Model Fusion (2412.03187v2)

Published 4 Dec 2024 in cs.CL

Abstract: While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the Weighted-Reward Preference Optimization framework, fusing diverse LLMs without the need for explicit vocabulary alignment.
It employs a progressive adaptation strategy that gradually shifts reliance from target to source examples to manage distributional discrepancies.
Empirical results demonstrate WRPO’s superior performance over baselines on benchmarks like AlpacaEval-2 and Arena-Hard, underscoring its efficiency and scalability.

Overview of Weighted-Reward Preference Optimization for Implicit Model Fusion

The paper "Weighted-Reward Preference Optimization for Implicit Model Fusion" introduces a novel approach designed to enhance the fusion of heterogeneous LLMs with varying architectures and parameter sizes. While traditional methods for model fusion necessitate complex processes such as vocabulary alignment and distribution matrix merging—often leading to inefficiencies and potential errors—this paper presents an alternative strategy that improves upon these methods in terms of both simplicity and performance.

Methodological Framework

At the crux of the paper is the Weighted-Reward Preference Optimization (WRPO) framework. WRPO circumvents the conventional alignment and merging complexities by employing a preference optimization method that implicitly fuses models. The method leverages the strengths of different models to a target LLM without explicit reliance on vocabulary alignment. The WRPO is particularly distinguished by its progressive adaptation strategy, which gradually transfers the reliance from preferred examples generated by the target LLM to those from the source LLMs. This strategy is essential for addressing potential distributional discrepancies between source and target models, thus ensuring a smoother transition and effective capability transfer.

Performance Evaluation

The empirical evaluation of WRPO is extensive and compelling. It surpasses existing knowledge fusion methodologies and fine-tuning baselines, as evidenced by robust performances across several benchmarks such as MT-Bench, AlpacaEval-2, and Arena-Hard. Particularly noteworthy is WRPO's performance when applied to LLaMA3-8B-Instruct as the target model. It showcases a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. These results underscore the efficacy of WRPO in not only enhancing the model's capabilities but also outperforming models that are traditionally viewed as benchmarks in the field.

Theoretical and Practical Implications

The research offers significant implications in both theoretical and practical contexts. Theoretically, it challenges the prevailing paradigms in model fusion by demonstrating that implicit fusion via preference optimization can be as effective, if not more, than explicit methods. On a practical level, WRPO presents a more scalable and less complex approach, reducing the computational overhead typical of ensemble models that require active inference across multiple LLMs simultaneously.

Furthermore, the application of a weighted preference objective marks a progression in alignment techniques for LLMs. By integrating outputs from multiple powerful source LLMs into a singular, proficient target model, WRPO enhances the model's collective capabilities while maintaining operational efficiency. This methodology is likely to influence future developments in LLM fusion, potentially leading to more refined techniques that prioritize implicit over explicit fusion strategies.

Speculations on Future AI Developments

Looking forward, the deployment of methods like WRPO may lead to advancements in constructing more generalized AI models capable of leveraging diverse strengths from various architectures and data-informed models. The adaptability of WRPO suggests potential for enhanced transfer learning and multi-modal model integration, heralding the next evolution of AI systems distinguished by their adaptability, efficiency, and enhanced learning from heterogeneously sourced data.

This paper, therefore, represents a significant contribution to the ongoing discourse on LLM ensemble strategies and presents an interesting avenue for future exploration and innovation in AI model fusion methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - SLIT-AI/WRPO (1 star)

Tweets

https://twitter.com/rohanpaul_ai/status/1865884949209031057