Effective combination of MoMa and Mixture-of-Depths for autoregressive inference
Develop an effective method to combine mixture of modality-aware experts (MoMa) with mixture-of-depths (MoD) within mixed-modal early-fusion transformer language models so that autoregressive inference performance improves, rather than degrades, relative to MoMa-only baselines while retaining the pre-training efficiency benefits of both techniques.
References
However, effectively combining these two approaches for better performance in an autor-regressive inference setup remains an open research challenge.
                — MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
                
                (2407.21770 - Lin et al., 31 Jul 2024) in Section 1, Introduction (Main contributions list, item 2)