Parameter-efficient Tuning of Large-scale Multimodal Foundation Model (2305.08381v3)

Published 15 May 2023 in cs.CV

Abstract: Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning, which explores the low intrinsic dimension with only 0.04% parameters of the pre-trained model. Then, for better modality alignment, we propose the Informative Context Enhancement and Gated Query Transformation module under extremely few parameters scenes. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach. Our code is available at: https://github.com/WillDreamer/Aurora.

Citations (18)

View on Semantic Scholar

Summary

The paper introduces Aurora, a novel parameter-efficient method using tensor decomposition to tune large multimodal foundation models with minimal computational cost.
Aurora achieves significant parameter reduction through CP decomposition and enhances cross-modal alignment with dedicated context enhancement and gated transformation modules.
Evaluations show Aurora performs comparably to or better than full fine-tuning on benchmarks while using only 0.04% of baseline model parameters, highlighting its scalability and efficiency.

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model: Aurora

The paper "Parameter-efficient Tuning of Large-scale Multimodal Foundation Model" introduces a novel approach named Aurora, designed to address challenges associated with adapting large-scale pre-trained multimodal models to various downstream tasks. The key dilemma the paper addresses is the excessive parameter reliance, which significantly increases computational load and limits the adaptability of large models to more general scenarios. Traditional fine-tuning methodologies necessitate vast resources and may not optimally leverage pre-trained knowledge in small-scale tasks. This paper proposes parameter-efficient transfer learning strategies to mitigate these challenges, specifically within multimodal contexts.

Lightweight Mode Approximation

Aurora leverages mode approximation based on the CANDECOMP/PARAFAC (CP) tensor decomposition method, which facilitates the reduction of trainable parameters necessary for model adjustment. The concept of intrinsic dimensionality is pivotal; despite large-scale models possessing extensive high-dimensional parameters, the intrinsic dimension relevant to specific downstream tasks remains modest. By capitalizing on the redundancy inherent in pre-trained architectures, Aurora injects a mere 0.1M trainable parameters, amounting to 0.04% of the baseline model’s parameters, without sacrificing efficacy. This substantial reduction in parameter dependency allows for a scalable and economic adaptability of multimodal models.

Enhancements for Modality Alignment

The paper articulates the importance of aligning various modalities to enhance cross-modal transfer robustness. Aurora introduces two modules for improved alignment: the Informative Context Enhancement and the Gated Query Transformation modules. The former enriches the fusion features by incorporating adaptive contextual information derived from unimodal query features. Through weighted attention mechanisms, Aurora balances feature augmentation, facilitating comprehensive cross-modal representation. The latter module employs a gating system that adaptively controls the incorporation of text features during cascade modality fusion, ensuring critical text information is preserved while minimizing loss during deep network processing.

Evaluation and Implications

Evaluations across six cross-modal benchmarks highlight Aurora’s efficacy, revealing performance comparable to or surpassing full fine-tuning approaches, despite its minimal parameter footprint. Notably, Aurora demonstrates an increase in performance metrics, such as a 1.8% improvement on the MSRVTT dataset and a 0.5% advancement on VQAv2 benchmarks, outperforming other parameter-efficient methods. The scalability is further evidenced by the strong performance even with extremely low parameter dependency relative to existing methods, underscoring Aurora's merit within computationally constrained settings.

The implications of Aurora extend beyond practical computational efficiencies; theoretically, it offers a paradigm shift in optimizing large-scale multimodal models by leveraging tensor decomposition techniques to address inherent feature space redundancy. The findings suggest potential adaptability improvements for foundation models in increasingly diverse applications, posing significant advancements in how AI interfaces with multimodal tasks. This approach could enhance scalability, democratizing the deployment of sophisticated AI models in various industries and applications traditionally restricted by computational and resource constraints.

Future Directions in AI

The development and successful implementation of Aurora illustrate an intriguing direction in AI research: minimizing dependency on extensive computational resources while maximizing model adaptability and performance. Future research could explore the refinement of CP decomposition techniques or alternative tensor approximation methodologies to further enhance efficiency in parameter tuning. Moreover, extending these principles to other domains, including unimodal models, could lead to breakthroughs in resource-efficient AI development.

In conclusion, this paper offers a strategic approach to facilitating practical and theoretical advancements in multimodal AI development, balancing resource constraints with robust model performance across various downstream tasks.

Related Papers

GitHub

GitHub - WillDreamer/Aurora: [NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model (84 stars)