Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993v4)

Published 8 Oct 2024 in cs.CV

Abstract: Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.

Abstract PDF HTML Chat (Pro)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a novel open multimodal MoE model that assigns 3.9B parameters per visual token and 3.5B per text token to enhance computational efficiency.
It employs a four-stage pre-training pipeline using 6.4 trillion language tokens and 400 billion multimodal tokens to improve long-context understanding.
Evaluation shows Aria outperforming comparable models on language, vision, and coding tasks, setting a new benchmark for open-source multimodal research.

Aria: An Open Multimodal Mixture-of-Experts Model

The paper presents "Aria", an open multimodal native model distinct in its architecture and open accessibility, addressing key challenges in integrating diverse real-world modalities. As a Mixture-of-Experts (MoE) model, Aria deploys 3.9 billion and 3.5 billion activated parameters per visual and text token, respectively. This structure underscores a notable advance in performance across multimodal, linguistic, and coding tasks in comparison to counterparts like Pixtral-12B and Llama3.2-11B, while maintaining competitive standings with leading proprietary models.

Model Architecture

Aria's architecture leverages a fine-grained MoE decoder known for computational efficiency over conventional dense models. The model activates an optimal number of parameters per input, thus streamlining both training and inference phases. Specifically, Aria comprises 24.9 billion parameters with unique allocation per input token—3.5 billion for text and 3.9 billion for visuals—ensuring robust performance despite reduced parameter count.

Pre-training Pipeline

The model's training follows a sophisticated four-stage pipeline:

Language Pre-training: Integrates a vast dataset of 6.4 trillion language tokens to instill foundational linguistic understanding.
Multimodal Pre-training: Enhances the model's multimodal competencies using 400 billion tokens across diverse data categories, establishing solid cross-modality insights.
Multimodal Long-Context Pre-training: Extends the context window to 64,000 tokens, significantly enhancing the model’s capacity for long-form content comprehension.
Multimodal Post-training: Refines the model's precision in question-answering and instruction-following through select datasets, integrating feedback-driven enhancements.

Evaluation

Aria has demonstrated superior efficacy across a comprehensive range of benchmarks. Key performance indicators include superior results in long-context video and document understanding and notable alignment with proprietary models, such as GPT-4o, across various assignments.

Contribution to Multimodal Understanding

Aria contributes significantly to multimodal research by showcasing novel training procedures for constructing capable multimodal native models and by releasing model weights that facilitate unprecedented adoption and adaptation. This openness presents practical implications as it catalyzes further replication and exploration of Aria's architectural paradigms across academic and commercial applications.

Future Directions

Challenges remain, especially concerning further optimizing the computational efficiency and examining possible enhancements using more advanced routing algorithms within the MoE framework. Future research could explore these routes while comparing Aria's scalability and effectiveness against emerging proprietary multimodal models.

By publicly opening Aria's resources, the paper sets a formidable benchmark for open-source contributions to multimodal AI research, inspiring subsequent developments within the field. This initiative holds promise for fostering collaboration and innovation in managing and understanding the rich complexity of multimodal data interactions.