Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity (2501.16295v1)

Published 27 Jan 2025 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba

Summary

The paper introduces Mixture-of-Mamba, a novel architecture that enhances multi-modal State Space Models (SSMs) by incorporating modality-aware sparsity.
This architecture achieves significant reductions in training FLOPs compared to dense models while maintaining or improving performance across text, image, and audio modalities.
Mixture-of-Mamba is verified in three-modality settings, demonstrating robustness and scalability for efficient multi-modal pretraining.

Overview of Mixture-of-Mamba: Modality-Aware Sparsity in State-Space Models

The paper "Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity" presents a novel architecture designed to improve the performance of State Space Models (SSMs) in multi-modal tasks through a refined sparse design. This research is anchored in the field of machine learning, particularly in sequential modeling using SSMs, which have been touted as efficient alternatives to Transformers due to their linear scaling in sequence length and strong performance in single-modality tasks. However, until now, SSMs have not fully exploited modality-specific features, leading to suboptimal results especially in multi-modal pretraining.

The proposed Mixture-of-Mamba introduces modality-aware sparsity by specifically parameterizing each modality within the Mamba block, which extends the concept of the Mixture-of-Transformers architecture to State-Space Models. By doing so, Mixture-of-Mamba maintains computational efficiency while significantly reducing the computational costs associated with training deep models across multiple modalities such as text, image, and audio.

Methodological Advancements

The Mixture-of-Mamba approach differs from conventional dense models by introducing sparsity at critical points of the SSM architecture. The Mixture-of-Mamba block uses modality-specific parameterization for the projections handling input, intermediate, and output stages. This strategically integrates sparse computation, which is dynamically based on the input modality, yielding highly efficient training.

Modality-Aware Parameterization: This architecture introduces sparsity by decoupling typical dense model parameters across different modalities. This specialization allows Mixture-of-Mamba to efficiently learn and represent modality-specific characteristics during training.
Training Efficiencies: Across evaluated model scales and training contexts, the architecture consistently achieves the same or lower loss values with significantly reduced floating-point operations per second (FLOPs). Notably, in extensive computational settings like the Transfusion and Chameleon frameworks, Mixture-of-Mamba achieves the target loss metrics far earlier in training than its dense counterparts, demonstrating both time and computational savings.
Three-Modality Training: The extension of Mixture-of-Mamba to a three-modality framework (text, image, speech) verifies the robustness and scalability of the architecture. Even when tested in these complex environments, the approach maintains a significant reduction in training cost while matching or improving loss performance metrics across all three modalities relative to the dense models.

Experimental Results

The paper demonstrates that Mixture-of-Mamba surpasses traditional dense SSM and Flex-Attention Transformer models across multiple benchmark datasets and diverse tasks.

In particular, during Transfusion multi-modal training, for image modality alone, Mixture-of-Mamba meets target image losses at only 34.76% of the training FLOPs in sizeable models.
In the Chameleon setting, the architecture realizes a relative training FLOPs reduction of more than 50% across both text and image modalities, all while retaining comparable performance to baseline models.

Implications and Future Directions

The paper's contributions underline a significant step towards scalable and efficient multi-modal pretraining architectures by demonstrating the feasibility and effectiveness of introducing modality-specific sparsity into SSMs. Theoretically, this suggests a promising direction for future research that could explore even finer-grained sparsity within and across modalities. Practically, this architecture lays the groundwork for deploying high-performance multi-modal machine learning solutions in resource-constrained environments by dramatically reducing the computational expense associated with model training.

In conclusion, Mixture-of-Mamba offers a versatile model design principle extending far beyond conventional dense architectures by emphasizing the role of modality-awareness in sculpting sparsely parameterized state-space models. This research not only sets new benchmarks in the context of multiscale and multi-modal training but also embodies a forward-thinking approach to the evolution of next-gen machine learning architectures, opening new possibilities for further explorations in the domain of AI.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Weixin-Liang/Mixture-of-Mamba (2 stars)

Tweets

https://twitter.com/TheTuringPost/status/1885087589490974784

https://twitter.com/fly51fly/status/1884395922429542626

https://twitter.com/TheTuringPost/status/1886749038034501875

https://twitter.com/arXivGPT/status/1884664234044522664

https://twitter.com/lopezunwired/status/1907108834650530005