Papers
Topics
Authors
Recent
2000 character limit reached

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens (2512.02536v1)

Published 2 Dec 2025 in cs.CV

Abstract: Recent progress in multimodal LLMs (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-LLMs (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.