Optimal interface between multimodal LLMs and diffusion generators
Determine the optimal interface for connecting pretrained multimodal large language models (MLLMs) to diffusion-based visual generative models so that the MLLMs’ reasoning and planning can provide fine-grained spatial and temporal control beyond global conditioning during the diffusion process.
Sponsor
References
The optimal interface between (M)LLMs and diffusion models remains unclear.
— Exploring MLLM-Diffusion Information Transfer with MetaCanvas
(2512.11464 - Lin et al., 12 Dec 2025) in Section 1 (Introduction)