Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Published 18 Feb 2025 in cs.CV | (2502.13145v2)

Abstract: Recent Multimodal LLMs (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

Abstract PDF Upgrade to Chat

Authors (8)

Summary

An Analytical Overview of "Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation"

The paper "Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation" presents a compelling approach to enhancing the computational efficiency and deployment feasibility of Multimodal LLMs (MLLMs). Existing MLLMs have achieved significant success, but their practicality is often hampered by quadratic computational complexities and reliance on distinct vision encoders. The authors introduce mmMamba, a novel framework that transitions trained decoder-only MLLMs into linear-complexity models by leveraging a three-stage distillation process. This transition not only maintains the multimodal capabilities of the original models but also significantly enhances their scalability and resource efficiency.

Key Contributions and Methodology

The paper's central contribution is the development of a three-stage progressive distillation framework that facilitates the direct conversion of trained Transformer-based MLLMs into models based on Mamba-2, a state space model (SSM) known for its linear computational complexity. The specific stages involve:

Seeding Strategy: This approach involves initializing Mamba-2 parameters by inheriting from a pre-trained Transformer, ensuring the initial behavior closely mimics the Transformer's attention mechanism.
Stage-1 and Stage-2 Distillation: These stages separately tune newly introduced SSM-specific parameters and entire layer parameters. Training is performed to align layer-wise behaviors between Mamba-2 and the original Transformer to preserve multimodal understanding.
Stage-3 Distillation: This final stage applies end-to-end knowledge alignment by minimizing the KL-divergence between outputs of the student (Mamba-2) and teacher (Transformer) models.

The framework supports two architectural variants: mmMamba-linear, which converts all Transformer layers to Mamba-2, ensuring a fully linear-complexity model, and mmMamba-hybrid, which strategically interleaves Mamba-2 and Transformer layers to offer a balance between efficiency and performance.

Empirical Validation and Results

The authors substantiate their contributions with extensive experimental validation across multiple benchmarks. Noteworthy findings include:

Performance: mmMamba-linear achieves competitive results against existing high-complexity Vision LLMs (VLMs) with considerable parameter efficiency, even surpassing SOTA models such as EVE-7B with half the parameter count.
Computational Efficiency: The mmMamba-linear model exhibits a significant enhancement in inference efficiency, demonstrating up to 20.6× speedup and 75.8% GPU memory reduction compared to the quadratic-complexity HoVLE model.
Flexibility of Design: The architecture is adaptable, with the mmMamba-hybrid offering performance improvements across tasks when incorporating a mix of Transformer and Mamba-2 layers. This variant shows a commendable balance in resource consumption and output quality.

Implications and Future Directions

The proposed framework's primary implication is its potential to democratize access to efficient multimodal processing in resource-constrained environments, thus expanding opportunities for deploying intelligent systems on edge devices. Theoretical advancements stem primarily from successfully navigating the distillation from quadratic to linear complexity, a feat that hints at scalable and sustainable development paths in AI model design.

For future developments, the paper suggests exploring improvements in the seeding and distillation strategies to enhance initialization robustness and the potential for deeper Mamba-2 architectures that might further bridge performance gaps with large-scale Transformers. Additionally, expanding the application domain to include video and high-resolution image processing represents a promising extension avenue, leveraging Mamba-2's memory efficiency for processing extended sequences.

In closing, the research effectively outlines a path forward for more efficient multimodal models, highlighting both the practical benefits and the still-unlocked potential of transitioning sophisticated AI architectures into more deployable forms through strategic distillation methodologies.

Markdown Report Issue