Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts (2407.21770v3)

Published 31 Jul 2024 in cs.AI and cs.LG

Abstract: We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion LLMs. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion LLM pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  2. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
  3. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
  4. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023. URL https://arxiv.org/abs/2312.17172.
  5. Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv.org/abs/2405.09818.
  6. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020. URL https://arxiv.org/abs/2006.16668.
  7. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101.03961.
  8. Unified scaling laws for routed language models, 2022. URL https://arxiv.org/abs/2202.01169.
  9. Mixtral of experts, 2024. URL https://arxiv.org/abs/2401.04088.
  10. Mixture-of-depths: Dynamically allocating compute in transformer-based language models, 2024. URL https://arxiv.org/abs/2404.02258.
  11. Multimodal contrastive learning with limoe: the language-image mixture of experts, 2022. URL https://arxiv.org/abs/2206.02770.
  12. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions, 2023. URL https://arxiv.org/abs/2209.03430.
  13. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts, 2022. URL https://arxiv.org/abs/2111.02358.
  14. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022a.
  15. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  16. Mixture-of-experts with expert choice routing, 2022. URL https://arxiv.org/abs/2202.09368.
  17. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  18. Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023. URL https://arxiv.org/abs/2212.05055.
  19. Image as a foreign language: Beit pretraining for all vision and vision-language tasks, 2022b. URL https://arxiv.org/abs/2208.10442.
  20. Swin transformer v2: Scaling up capacity and resolution, 2022a. URL https://arxiv.org/abs/2111.09883.
  21. Layerskip: Enabling early exit inference and self-speculative decoding, 2024. URL https://arxiv.org/abs/2404.16710.
  22. Openmoe: An early effort on open mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2402.01739.
  23. Gumbel-attention for multi-modal machine translation, 2022b. URL https://arxiv.org/abs/2103.08862.
  24. How does selective mechanism improve self-attention networks?, 2020. URL https://arxiv.org/abs/2005.00979.
  25. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  26. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5:288–304, 2023.
  27. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. 2024.
  28. Efficient large scale language modeling with mixtures of experts. CoRR, abs/2112.10684, 2021. URL https://arxiv.org/abs/2112.10684.
  29. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  30. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pages 7432–7439, 2020.
  31. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  32. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  33. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  34. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  36. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  37. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  38. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  39. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  40. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  41. Nüwa: Visual synthesis pre-training for neural visual world creation. CoRR, abs/2111.12417, 2021. URL https://arxiv.org/abs/2111.12417.
  42. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xi Victoria Lin (39 papers)
  2. Akshat Shrivastava (25 papers)
  3. Liang Luo (43 papers)
  4. Srinivasan Iyer (20 papers)
  5. Mike Lewis (78 papers)
  6. Luke Zettlemoyer (225 papers)
  7. Armen Aghajanyan (31 papers)
  8. Gargi Ghosh (30 papers)
Citations (12)

Summary

Efficient Early-Fusion Pre-Training with Mixture of Modality-Aware Experts: A Detailed Overview

The paper presents MoMa, a modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion LLMs. By leveraging modality-specific expert modules, MoMa introduces a novel approach to handling images and text, resulting in significant computational efficiency improvements. This summary explores the paper's contributions, methodologies, and potential implications in the domain of multimodal AI systems.

Core Contributions

The authors identify and address the inherent computational challenge of scaling mixed-modal early-fusion models. Mixed-modal models typically use a unified architecture for integrating different types of data, such as text and images, but this can lead to significant computational inefficiencies. The key contributions of the paper include:

  1. Introduction of Modality-Aware Experts: MoMa divides experts into modality-specific groups to handle different token types more efficiently while maintaining effective cross-modality integration via shared self-attention mechanisms.
  2. Hierarchical Routing Mechanism: A two-stage routing process is employed wherein tokens are first routed based on their modality and then further routed within their respective modality-specific group.
  3. Combination with Mixture-of-Depths (MoD): The architecture also incorporates MoD to introduce significant depth sparsity, enabling tokens to selectively skip certain layers, thus enhancing computational efficiency.
  4. Empirical Analysis and Performance Gains: Extensive empirical evaluations illustrate that MoMa achieves substantial pre-training efficiency gains, significantly reducing FLOPs while maintaining competitive performance.

Methodology

Modality-Aware Sparsity

The application of modality-specific modules, termed Modality-Aware Sparsity (MaS), optimizes the processing by acknowledging the distinct characteristics of text and image tokens. The procedure divides the experts into text and image-specific groups, routing tokens within these groups to maintain high semantic relevance and adaptivity.

Expert Choice Routing: The paper employs expert-choice (EC) routing to ensure each expert processes a balanced number of tokens, facilitating high training throughput.

Hierarchical Routing: Tokens are initially routed based on modality-specific characteristics and subsequently assigned within the expert groups using learned routing functions. This hierarchical structure aids in optimizing both intra-modality and cross-modality information processing.

Mixture-of-Depths (MoD)

By integrating MoD, the tokens can selectively skip the computation in certain layers, guided by additional auxiliary routers. This novel approach of combining both width and depth sparsity results in notable efficiency gains, despite some performance compromise in scenarios like causal inference.

Experimental Evaluation

Efficiency and Performance

The authors provide an extensive empirical evaluation through several experimental configurations, controlling for FLOPs to ensure fairness in comparison. Some key findings include:

  • Improved Pre-Training Efficiency: MoMa, with 4 text experts and 4 image experts, achieved a 3.7× overall FLOPs savings compared to a dense baseline, with specific savings of 2.6× for text and 5.2× for images.
  • Combining MoMa with MoD: This combination referred to as ChaMoMaD, achieved a 4.2× overall FLOPs savings (text: 3.4×, image: 5.3×), although with reduced performance during causal inference due to increased sensitivity to routing accuracy.

Practical Implications and Future Research

The efficiency gains introduced by MoMa have significant practical implications, offering a more resource-efficient methodology for developing multimodal AI systems. The results from the paper suggest that modality-aware sparsity along with hierarchical routing is a viable solution to the computational challenges that arise in mixed-modal early-fusion models.

Future Research Directions: The paper opens several avenues for future work:

  • Improving Routing Accuracy: Enhancing the accuracy of the auxiliary routers, especially in the context of MoD, is critical for better performance during causal inference.
  • Exploring Modality-Tied Architectures: Investigating more complex configurations, including combinations of different sparsity patterns, can potentially yield further advancements in efficiency and performance.
  • Expanding to More Modalities: Extending the current methodology to incorporate other modalities such as audio or video could be explored for broader application scenarios.

Conclusion

In summary, the MoMa architecture represents a significant step forward in optimizing mixed-modal early-fusion models. By addressing the unique computational demands of processing image and text tokens with modality-aware experts and combining width and depth sparsity, the proposed model achieves impressive efficiency gains without compromising performance. This work lays a robust foundation for the future development of scalable and resource-efficient multimodal AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com