Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement (2504.01934v2)

Published 2 Apr 2025 in cs.CV

Abstract: We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mLLM.github.io/.

An Overview of ILLUME+: Advancements in Unified Multimodal LLMs

The paper entitled "ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement" delivers a comprehensive examination of ILLUME+, an enhanced multimodal LLM (MLLM) that leverages dual visual tokenization and a diffusion decoder. This paper builds on previous efforts to integrate visual understanding, generation, and editing capabilities within a single framework, addressing the limitations faced by existing models like Chameleon, EMU3, and LaViT.

Methodological Innovations

The core of ILLUME+ is its DualViTok, a dual-branch vision tokenizer that captures both semantic and texture features. The semantic branch, employing a pre-trained vision encoder, focuses on extracting deep semantic information, while the pixel branch ensures the preservation of fine-grained textures using a MoVQGAN-based architecture. This combination marks a notable advancement over preceding models, which often sacrificed texture detail for semantic coherence or vice versa.

Additionally, ILLUME+ employs a diffusion model as an image detokenizer. This strategic inclusion not only enhances the quality of generated images but also supports efficient super-resolution, thereby addressing the token explosion issue common in autoregressive high-resolution generation tasks.

The unified approach of ILLUME+ is further reinforced by its progressive training procedure. This method supports dynamic resolution inputs across the vision tokenizer, MLLM, and diffusion decoder, enabling flexible and efficient context-aware operations such as image editing and high-fidelity image synthesis. The model’s capability to operate at resolutions of up to 1024×1024 is particularly noteworthy.

Empirical Results

Empirical evaluations underscore ILLUME+'s competitive edge against other unified MLLMs and specialized models across several benchmarks. Notably, the model excels in multimodal understanding and generation tasks, demonstrating significant improvements in areas like document image processing which require a high degree of comprehension and context retention.

On standard multimodal understanding tasks, ILLUME+ matches or surpasses state-of-the-art models such as Janus-Pro-7B and ILLUME-7B, particularly in document-oriented evaluations. This reflects the model's robust understanding capabilities courtesy of its sophisticated tokenization scheme.

In generation benchmarks, ILLUME+ achieves an FID score of 6.00 on the MJHQ-30K dataset—indicative of its superior image generation quality and diversity when compared to both standalone and unified models. This performance is attributed to the high level of detail preservation and semantic alignment facilitated by the dual tokenizer and diffusion decoder.

Implications and Future Directions

The theoretical implications of ILLUME+ are profound. By unifying understanding, generation, and editing in a scalable framework, the model not only bridges gaps between visual and textual modalities but also paves the way for exploring artificial general intelligence (AGI) through synergistic multimodal interactions.

Practically, ILLUME+ offers a scalable foundation for a myriad of applications requiring high-resolution image synthesis and complex image-text task coordination, such as virtual environments, automated content creation, and advanced visual data analysis.

Future research directions may explore scaling ILLUME+ to larger model parameters, refining training methodologies to balance task optimization further, and incorporating real-world datasets to bolster model robustness. Enhancements in dataset composition and task diversity could yield even richer semantic interactions, thereby expanding the horizons of multimodal AI applications.

In summary, the paper presents ILLUME+ as a significant stride towards more intelligent and flexible multimodal LLMs, maintaining a meticulous balance between semantic comprehension and visual fidelity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Runhui Huang (18 papers)
  2. Chunwei Wang (13 papers)
  3. Junwei Yang (17 papers)
  4. Guansong Lu (20 papers)
  5. Yunlong Yuan (4 papers)
  6. Jianhua Han (49 papers)
  7. Lu Hou (50 papers)
  8. Wei Zhang (1489 papers)
  9. Lanqing Hong (72 papers)
  10. Hengshuang Zhao (117 papers)
  11. Hang Xu (204 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com