Optimal Architectural Design for Unified Multimodal Understanding and Generation

Determine the optimal architectural design for unified multimodal models that jointly perform image understanding and text-to-image generation, specifying how to balance shared versus task-specific Transformer components to avoid representational conflicts while maintaining strong performance on both tasks.

Background

The paper investigates how modality alignment between image and text features evolves across Transformer layers for two tasks: image understanding and text-to-image generation. It finds that understanding benefits from progressively increasing alignment with depth, whereas generation prefers strong early alignment followed by reduced coupling to recover spatial details. Fully shared backbones under next-token prediction tend to force a compromise that harms one or both tasks.

Existing unified approaches either integrate diffusion objectives, use external decoders, or decouple visual encoders, often increasing complexity and failing to expose the intrinsic relationship between understanding and generation. Motivated by these insights, the authors propose UniFork, a Y-shaped architecture that shares shallow layers and splits into task-specific branches in deeper layers, but they explicitly note that determining the optimal architecture for unified modeling remains an open challenge.

References

Despite recent progress, the optimal architectural design for such unified models remains an open challenge.

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation  (2506.17202 - Li et al., 20 Jun 2025) in Abstract (page 1)