Optimal interface between multimodal LLMs and diffusion generators

Determine the optimal interface for connecting pretrained multimodal large language models (MLLMs) to diffusion-based visual generative models so that the MLLMs’ reasoning and planning can provide fine-grained spatial and temporal control beyond global conditioning during the diffusion process.

Background

The paper reviews the prevailing LLM+Diffusion paradigm, where the (M)LLM processes contextual signals and its outputs are injected into a separate diffusion generator, typically as a global condition. While effective for guidance, this global conditioning limits explicit spatial and temporal control, which is crucial for structured generation.

Motivated by this limitation, the authors introduce MetaCanvas, an approach that allows MLLMs to plan in latent spatial and spatiotemporal canvases and inject patch-wise priors into diffusion latents. The broader question the authors raise is how best to interface MLLMs with diffusion models to fully exploit the MLLMs’ reasoning and planning capabilities.

References

The optimal interface between (M)LLMs and diffusion models remains unclear.

Exploring MLLM-Diffusion Information Transfer with MetaCanvas (2512.11464 - Lin et al., 12 Dec 2025) in Section 1 (Introduction)