Dice Question Streamline Icon: https://streamlinehq.com

Effectiveness of MoS under Early-Fusion Training

Determine whether the Mixture of States (MoS) multimodal diffusion architecture achieves effective performance when trained in early-fusion settings where the understanding transformer and the generation transformer are jointly trained with bidirectional interactions, specifically by equipping the MoS router with multiple projection layers to establish transformer connections in both directions.

Information Square Streamline Icon: https://streamlinehq.com

Background

The Mixture of States (MoS) framework introduced in this paper uses a dual-tower architecture consisting of an understanding transformer (frozen during training) and a generation transformer, with a lightweight learnable router enabling adaptive, token-level, timestep-dependent interactions. The training strategy adopted is multi-stage and focuses compute on the generation tower while keeping the understanding tower fixed.

Prior work on Mixture of Transformers (MoT) has shown strong scalability under early-fusion training regimes where modalities are fused and trained jointly with symmetric layer correspondences. The authors note that while MoS performs well for multimodal generation in their current setup, it remains unknown whether MoS would retain or improve effectiveness under early-fusion training. They propose a principled extension—adding multiple projection layers to the router to enable bidirectional transformer connections—but defer experimentation due to computational and data constraints.

References

MoT has demonstrated strong scalability under early-fusion training. In contrast, while MoS shows promising results for multimodal generation, its effectiveness in early-fusion settings remains to be validated. A principled extension is to endow the router with multiple projection layers to establish bidirectional transformer connections. We defer this exploration to future work due to computational and data constraints.

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation (2511.12207 - Liu et al., 15 Nov 2025) in Limitation and Future Studies — One-Way to Dual-Way Setting