ATM: Improving Model Merging by Alternating Tuning and Merging (2411.03055v2)

Published 5 Nov 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.

PDF HTML Abstract

Overview of Alternating Tuning and Merging (ATM) for Model Merging

The research paper titled "ATM: Improving Model Merging by Alternating Tuning and Merging" presents a novel approach to tackle the challenges inherent in model merging methodologies, particularly in multi-task settings. The ATM framework underscores the intricate relationship between task arithmetic—an established model merging technique—and gradient descent processes. This paper ascertains that task vectors are conceptually equivalent to gradients computed over multi-task datasets, thus proposing Alternating Tuning and Merging (ATM) as an iterative alternative to conventional model aggregation methods.

Theoretical Insights and Contributions

The paper illuminates the theoretical oversight in standard one-step model merging methods that utilize task arithmetic, highlighting their propensity to overshoot the multi-task optimum. The authors show that in a single epoch, task vectors act as the additive inverse of gradient steps. This revelation extends to the fact that task vectors' efficacy can largely stem from the gradient direction determined in the initial finetuning epoch.

ATM is introduced as a framework that iteratively fine-tunes the model on individual tasks before merging. This iterative alternation reduces the likelihood of overshooting while incorporating interference-resolution strategies to enhance the final model's performance. The framework's flexibility allows it to integrate seamlessly with existing task-vector methods, circumventing additional computational burdens typically associated with task-vector pruning or elaborate weight adjustments.

Key contributions from the research include:

Demonstrating that task vectors, under specific conditions, either equate to or closely approximate gradients of task losses.
Highlighting that prevalent one-shot merging frameworks can overshoot multi-task optima, especially when task vectors possess large norms.
Introducing ATM as a generalized, iterative merging framework with empirical demonstrations of increased task vector orthogonality and theoretical validations on minimizing multi-task loss.

Empirical Evaluation and Results

Extensive experiments validate ATM's superiority over established baselines across diverse datasets in computer vision (ViT-B-16 backbone) and NLP (RoBERTa-base and BERT-base-uncased) tasks. Notably, ATM achieves up to 20% greater accuracy than current baseline methods. These results are consistent irrespective of compute budgets, with ATM showing pronounced improvements as more computational resources are apportioned.

An essential characteristic observed is ATM's ability to balance specialist performance with collective multi-task effectiveness, effectively merging the benefits of task-specific learning trajectories without succumbing to the limitations of prior one-step assumptions.

Implications and Future Directions

Practically, the ATM framework offers a robust method for multi-task learning scenarios where model storage constraints necessitate a single strong performance model. Theoretically, the insights on gradient alignment provide a foundation for further investigation into model merging dynamics. Future research might explore leveraging ATM's iterative framework with advanced gradient-descent techniques or interference-mitigation strategies to further enhance performance without detracting from computational efficiency.

Conclusion

ATM represents a significant advancement in the model merging field by addressing the inherent deficiencies of one-shot task-vector methods. Its design and empirical success open pathways for more nuanced, data-privacy-preserving multi-task models that can seamlessly integrate into diverse machine learning pipelines. The research underscores the value of iteratively tuning and merging models to achieve optimal balance and performance in multi-task settings.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Luca Zhou (1 paper)
Daniele Solombrino (1 paper)
Donato Crisostomi (15 papers)
Maria Sofia Bucarelli (14 papers)
Fabrizio Silvestri (75 papers)
Emanuele Rodolà (90 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1854019468550103307

https://twitter.com/vishaal_urao/status/1854195854392930465