- The paper introduces a novel open multimodal MoE model that assigns 3.9B parameters per visual token and 3.5B per text token to enhance computational efficiency.
- It employs a four-stage pre-training pipeline using 6.4 trillion language tokens and 400 billion multimodal tokens to improve long-context understanding.
- Evaluation shows Aria outperforming comparable models on language, vision, and coding tasks, setting a new benchmark for open-source multimodal research.
Aria: An Open Multimodal Mixture-of-Experts Model
The paper presents "Aria", an open multimodal native model distinct in its architecture and open accessibility, addressing key challenges in integrating diverse real-world modalities. As a Mixture-of-Experts (MoE) model, Aria deploys 3.9 billion and 3.5 billion activated parameters per visual and text token, respectively. This structure underscores a notable advance in performance across multimodal, linguistic, and coding tasks in comparison to counterparts like Pixtral-12B and Llama3.2-11B, while maintaining competitive standings with leading proprietary models.
Model Architecture
Aria's architecture leverages a fine-grained MoE decoder known for computational efficiency over conventional dense models. The model activates an optimal number of parameters per input, thus streamlining both training and inference phases. Specifically, Aria comprises 24.9 billion parameters with unique allocation per input token—3.5 billion for text and 3.9 billion for visuals—ensuring robust performance despite reduced parameter count.
Pre-training Pipeline
The model's training follows a sophisticated four-stage pipeline:
- Language Pre-training: Integrates a vast dataset of 6.4 trillion language tokens to instill foundational linguistic understanding.
- Multimodal Pre-training: Enhances the model's multimodal competencies using 400 billion tokens across diverse data categories, establishing solid cross-modality insights.
- Multimodal Long-Context Pre-training: Extends the context window to 64,000 tokens, significantly enhancing the model’s capacity for long-form content comprehension.
- Multimodal Post-training: Refines the model's precision in question-answering and instruction-following through select datasets, integrating feedback-driven enhancements.
Evaluation
Aria has demonstrated superior efficacy across a comprehensive range of benchmarks. Key performance indicators include superior results in long-context video and document understanding and notable alignment with proprietary models, such as GPT-4o, across various assignments.
Contribution to Multimodal Understanding
Aria contributes significantly to multimodal research by showcasing novel training procedures for constructing capable multimodal native models and by releasing model weights that facilitate unprecedented adoption and adaptation. This openness presents practical implications as it catalyzes further replication and exploration of Aria's architectural paradigms across academic and commercial applications.
Future Directions
Challenges remain, especially concerning further optimizing the computational efficiency and examining possible enhancements using more advanced routing algorithms within the MoE framework. Future research could explore these routes while comparing Aria's scalability and effectiveness against emerging proprietary multimodal models.
By publicly opening Aria's resources, the paper sets a formidable benchmark for open-source contributions to multimodal AI research, inspiring subsequent developments within the field. This initiative holds promise for fostering collaboration and innovation in managing and understanding the rich complexity of multimodal data interactions.