Localizing Task Information for Improved Model Merging and Compression (2405.07813v1)

Published 13 May 2024 in cs.LG and cs.CV

Abstract: Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance.

Authors (5)

Ke Wang (531 papers)
Nikolaos Dimitriadis (11 papers)
Guillermo Ortiz-Jimenez (21 papers)
François Fleuret (78 papers)
Pascal Frossard (194 papers)

Citations (16)

View on Semantic Scholar

Summary

Evaluating Task-Specific Parameter Localization for Model Merging and Compression

The paper under review addresses a critical concern in the domain of machine learning, specifically focusing on the integration and optimization of multiple fine-tuned models into a single multi-task model. As the size and complexity of machine learning models continue to grow exponentially, effective techniques for model merging and compression have become essential. The authors propose a novel approach, TALL-masks, aimed at improving model merging strategies by localizing task-specific parameters in the parameter space, thereby addressing issues related to task interference and model compression.

The research identifies two main causes for performance degradation when merging models for different tasks: weight interference and task interference. Weight interference is a well-documented phenomenon where the parameters relevant to specific tasks are overwritten, leading to loss of task-specific information. In contrast, task interference suggests that task-specific information is indeed preserved post-merging but is not effectively utilized due to overlaps in task requirements. The proposed paper hypothesizes that through the careful selection and activation of task-centric parameters, task performance can be significantly improved without additional computational overhead.

Methodology

The authors present TALL-masks, an innovative method that constructs binary masks to pinpoint and retain task-relevant weights in a merged vector. The construction of these masks is grounded on a data-driven procedure which seeks to identify the parameters contributing most substantially to a task's success. TALL-masks effectively filters the shared parameter space post-merging to localize and maintain the distinct attributes of each task vector in such a way that performance is optimized.

The framework for constructing these masks involves evaluating the marginal contribution of each weight, enabling precise selection and pruning. This results in models that can be stored more efficiently by encoding only the essential parameter subsets, thereby significantly reducing the footprint on storage while maintaining high fidelity of the initial fine-tuned models.

Results

The results highlighted in the paper demonstrate significant efficacy of the TALL-masks approach across multiple scenarios in both computer vision and natural language processing domains. Notably, the authors report restoring over 99% of performance from the individual fine-tuned models in scenarios with up to 20 tasks. This masks-based strategy further allows compression of storage requirements from 57Gb to 8.2Gb while maintaining negligible loss in task performance.

Importantly, the paper also introduces "Consensus Merging", a novel approach extending the TALL-masks by focusing on the consensus importance of parameters across tasks. This technique aims to enhance the incorporation of task-disparate information by eliminating weights deemed solely useful to individual tasks, hence boosting merged-model performance.

Implications

Overall, the implications of this paper are profound within the sphere of optimizing large-scale models. The proposed solutions not only enhance the practical merging and storage efficiency of models but also lay the groundwork for developing more nuanced and adaptable multi-task learning frameworks. The ability to recover and maintain task-specific performance within a singular model framework is instrumental for developing generalist AI systems which need to balance diverse task demands.

Future Directions

The findings of this paper open multiple avenues for further research. Primarily, the exploration of deeper integration strategies for task-specific parameters in more intricate neural architectures would be beneficial. Additionally, future work should investigate the scalability of TALL-masks within more varied and applicable real-world datasets and applications, seeking to optimize hyperparameters automatically and extending these methods to unsupervised and reinforcement learning paradigms.

In summary, the authors present a compelling case for re-evaluating current model merging paradigms by adopting task-specific parameter localization, offering meaningful improvements in both task performance and model compression. The applicability of this method in both vision and NLP benchmarks underscores its versatility and potential to advance the current state of art in model optimization strategies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nikdimitriadis/status/1797638468610466044

https://twitter.com/s_scardapane/status/1798727847663517888

https://twitter.com/realmofresearch/status/1791884336251511131

https://twitter.com/gm8xx8/status/1790255867553624335

YouTube

Show All Videos