Effective combination of MoMa and Mixture-of-Depths for autoregressive inference

Develop an effective method to combine mixture of modality-aware experts (MoMa) with mixture-of-depths (MoD) within mixed-modal early-fusion transformer language models so that autoregressive inference performance improves, rather than degrades, relative to MoMa-only baselines while retaining the pre-training efficiency benefits of both techniques.

Background

The paper introduces modality-aware sparsity for mixed-modal, early-fusion LLMs via two orthogonal techniques: width scaling with mixture of modality-aware experts (MoMa) and depth scaling with mixture-of-depths (MoD). MoMa divides experts into text-specific and image-specific groups with learned intra-modality routing, while MoD dynamically skips computation at certain layers using learned routers.

Empirically, both MoMa and MoD individually accelerate pre-training loss convergence under matched FLOPs, and their combination further improves pre-training efficiency. However, the authors find that combining MoD with MoMa hurts autoregressive inference performance due to increased sensitivity to router accuracy, and auxiliary routers trained post hoc do not sufficiently mitigate this in causal settings. This motivates an explicit open research challenge to develop an approach that jointly leverages MoMa and MoD while maintaining strong inference-time performance.

References

However, effectively combining these two approaches for better performance in an autor-regressive inference setup remains an open research challenge.

— MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts (2407.21770 - Lin et al., 31 Jul 2024) in Section 1, Introduction (Main contributions list, item 2)

Effective combination of MoMa and Mixture-of-Depths for autoregressive inference

Background

References

Related Problems