MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models (2506.14435v1)

Published 17 Jun 2025 in cs.CV and cs.LG

Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

Summary

Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

The paper introduces MoTE, a novel architecture designed to address the memory inefficiency prevalent in large multimodal models leveraging Mixture-of-Experts (MoE). MoEs traditionally improve computational efficiency by activating only a subset of a large pool of experts for each input, thus reducing the required floating-point operations (FLOPs). However, these models typically have high memory footprints due to their reliance on full-precision experts, posing challenges for deployment, especially on edge devices with constrained resources. MoTE proposes an alternative by up-cycling from dense checkpoints with ternary experts, thereby enabling efficient scaling without excessive memory demands.

Architectural Innovations

MoTE employs a unique approach in its expert configuration. Rather than training fewer high-precision experts, it involves training more low-precision, ternary experts. The model utilizes pre-trained feed-forward networks (FFNs) as shared components and introduces ternary routed experts with parameters constrained to {-1, 0, 1}. This configuration allows the model to maintain high performance on end tasks comparable to full-precision models while significantly reducing the memory footprint.

Strong Empirical Results

Extensive experiments illustrate MoTE's scalability and memory efficiency:

On model up-cycling from dense checkpoints, MoTE achieves performance comparable to its full-precision counterpart, MoE-LLaVA, at model sizes exceeding 1.5 billion parameters.
MoTE, equipped with post-training quantization, notably improves efficiency. Given a fixed expert memory footprint of 3.4GB, MoTE outperforms MoE-LLaVA by 4.3% in average accuracy, underscoring its potential for deployment on memory-constrained devices.

Advantageous Scaling

The scaling behavior of ternary up-cycling models suggests promising implications. As model size increases, the performance gap between MoTE and full-precision MoEs narrows, indicating that ternary models are likely to match or exceed performance at larger scales. This behavior aligns with other findings in the domain of low-bit pre-training for large models, pointing to potential avenues where MoTE can play a pivotal role in developing large-scale AI systems that are both resource-efficient and high-performing.

Practical and Theoretical Implications

This research contributes significantly to the field by offering an efficient architecture capable of maintaining model performance while reducing memory costs, crucial for real-world applications on edge devices. The proposed ternary experts, with their compatibility with standard quantization techniques, open pathways for further exploration in reducing model deployment costs across various hardware configurations.

Theoretically, MoTE challenges traditional approaches to expert configuration by demonstrating the viability of ternary setups. It prompts further inquiry into the dynamics of ternary model training and optimization, potentially informing future advancements in efficient model design strategies.

Future Directions

The paradigm introduced by MoTE invites several future research directions:

Thorough investigation into the theoretical underpinnings and training dynamics of ternary MoE models, which could further refine efficient model designs.
Expansion of MoTE's application beyond multimodal settings to other areas requiring efficient large-scale processing.
Development of improved training recipes and initialization strategies that enhance convergence rates and performance outcomes of ternary experts.

In conclusion, MoTE stands as a promising avenue towards more resource-efficient large multimodal models, promoting advancements in memory-efficient designs without compromising computational efficacy. It offers a compelling alternative to full-precision models, particularly in scenarios demanding lightweight deployment, marking notable progress in the quest for efficient AI model architectures.