Effectiveness of MoS under Early-Fusion Training
Determine whether the Mixture of States (MoS) multimodal diffusion architecture achieves effective performance when trained in early-fusion settings where the understanding transformer and the generation transformer are jointly trained with bidirectional interactions, specifically by equipping the MoS router with multiple projection layers to establish transformer connections in both directions.
References
MoT has demonstrated strong scalability under early-fusion training. In contrast, while MoS shows promising results for multimodal generation, its effectiveness in early-fusion settings remains to be validated. A principled extension is to endow the router with multiple projection layers to establish bidirectional transformer connections. We defer this exploration to future work due to computational and data constraints.