Transformer Fusion with Optimal Transport

Published 9 Oct 2023 in cs.LG and stat.ML | (2310.05719v3)

Abstract: Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a novel framework leveraging Optimal Transport to align and merge transformer architectures while preserving key components like attention and residual connections.
The paper demonstrates consistent performance gains, approximately a 1% improvement in tasks such as image classification and NLP over traditional fusion methods.
The paper enables both homogeneous and heterogeneous fusion, offering efficient model compression and new insights into soft alignment in transformer models.

Transformer Fusion with Optimal Transport: A Systematic Approach

The presented paper discusses an advanced methodology for the fusion of transformer-based neural networks using Optimal Transport (OT), providing a nuanced approach that effectively aligns and merges models. This research builds upon previous methodologies predominantly applied to simpler architectures, such as fully connected, convolutional, and residual networks, and extends them to transformers. The novelty and strength of this work lie in its ability to handle the transformative properties of transformers, including attention mechanisms, layer normalization, and residual connections. The aim is to enhance model performance by leveraging the collective capabilities of independently trained models, offering an efficient alternative to traditional ensemble methods.

Methodology

The authors propose a novel framework for fusing transformer models through a structured process that involves aligning model weights using OT. The application of OT allows for the soft alignment of model parameters, addressing the symmetries and structure of transformer architectures that simpler networks do not possess. A key aspect of the methodology is the adherence to a "Transportation Map Flow Graph," a framework that ensures the appropriate flow of mappings across and within the layers of the transformers. This ensures the integrity of information flow across diverse architectural components within transformer models.

The fusion method is evaluated in both homogeneous and heterogeneous settings, meaning it can handle models of the same size and those of varying sizes, known as heterogeneous fusion. This flexibility is advantageous for model compression and provides a resource-efficient strategy for leveraging pre-trained models.

Results

The effectiveness of the proposed fusion technique is validated on multiple tasks, including image classification with Vision Transformers (ViTs) and natural language processing with BERT models. The paper reports that their approach significantly outperforms vanilla fusion methods and even the original individual models after finetuning. Specifically, the fused models achieve consistent improvements in performance, showing a notable ∼1% increase over non-fused scenarios and significantly reducing computational and storage needs compared to traditional methods like ensembling.

The ablation studies conducted by the authors reveal intriguing insights into the role of soft alignment for transformers, uncovering that soft alignment could outperform hard alignment—a concept previously underestimated in the context of transformer models. This revelation challenges prior assumptions held for other network types, thereby contributing to a deeper understanding of transformer-specific characteristics in model fusion.

Implications and Future Directions

The practical implication of this research is significant, offering a methodological path towards more efficient and powerful machine learning solutions without the prohibitive costs associated with training large transformers from scratch. By enabling the fusion of different-sized models and facilitating knowledge transfer from larger to smaller models, the method also provides a compelling strategy for model distillation.

On a theoretical level, this research enhances our understanding of the intrinsic symmetry and permutation invariance within transformer architectures—a subject not extensively explored before. It paves the way towards more generalized applications, potentially influencing the broader field of neural network design and optimization.

In the future, extending this methodology to support fusion across models of varying depths could represent a significant advancement, addressing a current limitation observed in both the presented work and existing literature. Such developments could further broaden the applicability and efficiency of transformer fusion techniques, impacting the evolving landscape of artificial intelligence.

In conclusion, the paper provides a comprehensive study on the fusion of transformer models through Optimal Transport, presenting both a robust theoretical foundation and clear evidence of improved empirical performance. This research stands as a substantial contribution toward advancing the efficiency and capability of neural network models in practical applications.

Markdown Report Issue