Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Published 8 Aug 2025 in cs.CV, cs.AI, and cs.CL | (2508.05954v1)

Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into LLMs without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper proposes a bifurcated architecture that bridges pretrained MLLMs and diffusion models using patch-level CLIP latents for efficient image generation.
The method replaces VAE-based latents with native CLIP embeddings, enhancing spatial control and reducing computational overhead.
Empirical results demonstrate state-of-the-art image reconstruction quality with fewer decoding steps and scalable performance.

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Introduction

The integration of visual generation capabilities into LLMs posits significant computational challenges, primarily due to an LLM's lack of exposure to image representations during pretraining. "Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents" proposes a solution by introducing a framework that merges pretrained multimodal LLMs (MLLMs) and diffusion models through the use of patch-level CLIP image embeddings. These embeddings, aligned with the MLLM's CLIP visual encoder, serve as latent representations that bridge the two model types efficiently, enhancing both training ease and final image generation quality.

Figure 1: Comparison of different approaches in using LLM for image generation. (a) Single architecture handling both text and image tokens into the LLM/MLLM. (b) Bridging LLM and Diffusion model with a 1D sequence (image tokens or text tokens). (c) Bifrost-1, which bridges MLLM with diffusion models with 2D image tokens aligned with MLLM embeddings.

Framework Overview

Bifrost-1 employs a bifurcated architecture. The framework augments a backbone MLLM with a visual generation branch, consisting of a trainable copy of the MLLM’s parameters and a newly added vision head. The framework utilizes patch-level CLIP image latents, enabling high-fidelity and controllable image generation without significant degradation of the MLLM's existing reasoning capabilities.

Figure 2: Overview of Bifrost-1, showcasing the integration of visual generation and its guiding role via latent ControlNet.

The visual generation branch is trained to predict masked portions of images, utilizing a downsampling convolution block to prepare these embeddings for input into a latent ControlNet, a lightweight adaptation intended to guide the diffusion model more efficiently.

Implementation Details

Visual Encoder Alignment

The MLLM’s native visual encoder (CLIP) is the linchpin of Bifrost-1, enabling seamless integration through patch-level embeddings. By replacing VAE-based latents with these native embeddings, the system streamlines training, achieving efficient alignment without the need for extensive realignment or recalibration of existing pretrained models.

Latent ControlNet

ControlNet adds structural reinforcement to the diffusion model by embedding spatial controls directly into image generation. By leveraging patch-level CLIP latents as a medium for conveying spatial guidance, the framework maintains consistent multimodal reasoning abilities while improving the fidelity of generated visuals.

Results

Empirical tests illustrate that Bifrost-1 outperforms current unified multimodal models, maintaining or exceeding state-of-the-art benchmarks at a fraction of the computational cost. Bifrost-1 demonstrates excellent scalability and efficiency, evidenced by its superior performance across multiple image reconstruction quality metrics.

Figure 3: Image reconstruction scores with different numbers of 2D CLIP latent tokens used within Bifrost-1 on ImageNet.

Figure 4: Visual samples for image reconstruction with different numbers of patch-level CLIP tokens, highlighting the visual fidelity achieved.

Comparison with Baselines

Among various architecture choices, patch-level CLIP latents stand out as significantly more efficient than prior methods using learnable query tokens or external encoders. The integration enables substantial improvements in image quality metrics like FID and sFID while reducing the reliance on computationally expensive operations.

Decoding Steps and Efficiency

Crucial to Bifrost-1's application potential is its robust performance across different MLLM decoding step configurations. It maintains high visual output quality with a minimum of 8 decoding steps, thus offering flexibility for inference time-optimized deployments.

Conclusion

Bifrost-1 presents an advanced yet efficient architecture for integrating image generation into multimodal frameworks. By effectively leveraging pretrained models and aligning them with patch-level CLIP visual latents, this framework promises enhanced visual synthesis capabilities whilst preserving the computational efficiency crucial for scalable deployment. Future work can further explore the potential of larger datasets and more advanced MLLM backbones to fully exploit Bifrost-1's architectural advantages.

Markdown Report Issue