Optimal Architectural Design for Unified Multimodal Understanding and Generation
Determine the optimal architectural design for unified multimodal models that jointly perform image understanding and text-to-image generation, specifying how to balance shared versus task-specific Transformer components to avoid representational conflicts while maintaining strong performance on both tasks.
References
Despite recent progress, the optimal architectural design for such unified models remains an open challenge.
— UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
(2506.17202 - Li et al., 20 Jun 2025) in Abstract (page 1)