Diffusion Instruction Tuning (2502.06814v2)

Published 4 Feb 2025 in cs.LG, cs.AI, and cs.GR

Abstract: We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-LLMs (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

Summary

The paper introduces Lavender, a supervised fine-tuning method that aligns Vision-Language Model (VLM) attention with Diffusion Model attention to improve text-vision alignment.
Lavender leverages the more precise word-to-region attention learned by diffusion models, transferring this knowledge to VLMs during training to enhance visual understanding.
The method demonstrates significant performance gains on standard VLM benchmarks (up to 30%) and shows strong out-of-distribution generalization, including a 68% increase on the WorldMedQA medical benchmark.

The paper "Diffusion Instruction Tuning" introduces Lavender, a novel supervised fine-tuning (SFT) method designed to improve the performance of vision-LLMs (VLMs) by leveraging image generation models, specifically Stable Diffusion. The key idea is to align the text-vision attention mechanism within the VLM's transformer layers with that of Stable Diffusion during SFT, rather than independently adapting separate encoders. The authors posit that this alignment enhances the model’s visual understanding, leading to improved performance across both in- and out-of-distribution tasks.

The method hinges on the observation that diffusion models (DMs), which reconstruct images at the pixel level, learn more precise text-vision attention maps than VLMs trained solely for text token generation. Lavender transfers these high-quality cross-attention maps from DMs to guide the text-vision attention in VLMs during SFT, thereby improving word-to-region alignment. To mitigate catastrophic forgetting, the paper introduces attention aggregation methods and training strategies.

The authors conducted experiments on a small OpenFlamingo model, demonstrating that Lavender effectively aligns VLM attention with DM attention. Further experiments on Llama 3.2-11B, fine-tuned on in- and out-of-distribution data, showed performance improvements of up to 30% across 19 benchmarks. A 68% increase in performance was observed on the WorldMedQA medical benchmark. Ablation studies indicated that the method of attention aggregation and the choice of layers for fine-tuning are critical for performance.

Here is a more detailed breakdown:

Introduction: The paper addresses the challenge of training frontier foundation models from scratch, which is computationally expensive, especially in multimodal settings where paired image-text datasets are scarce. The authors propose Lavender as a solution that leverages the abundant text-only pretraining of LLMs and adjusts bridging layers or additional encoders with limited image-text pairs.
Related Work: The paper discusses the evolution of VLMs, starting with Flamingo, which uses separate encoders for images and text, unified through a perceiver resampler. It also highlights the challenges of training VLMs with dedicated cross-attention modules due to the substantial data and computational resources required. The paper reviews approaches that leverage instruction fine-tuning on scaled LLMs using visual question answering (VQA) data, aligning text and image tokens through fine-tuning connectors such as MLPs, encoders, or decoders. The paper also discusses recent attempts to integrate DMs and VLMs, noting that one overlooked aspect is the role of self- and cross-attention layers within DM and LLM Transformers.
Diffusion Instruction Tuning (Lavender): The method aims to enhance a pretrained VLM by leveraging attention distributions from a pretrained DM. The authors hypothesize that an ideal attention distribution maximizes VLM performance and that the DM's attention is closer to this ideal distribution. The core idea of the method is to update the VLM parameters $\theta$ $θ$ such that the model not only performs well on its primary task but also aligns its attention mechanism with that of the DM. The method minimizes the following loss function:
- $L_{\mathrm{total}(\theta)}$ is the total loss.
- $p\bigl(y_l^{(i)} | x^{(i)}, y_q^{(i)};\theta\bigr)$ is the probability of the $i$ -th target token $y_l^{(i)}$ (label) given the input image $x^{(i)}$ , the question $y_q^{(i)}$ , and the VLM parameters $\theta$ .
- $\delta^{(i)}(\theta)$ represents the pointwise difference between the VLM's and DM's attention distributions for the $i$ -th data point.
- $\lambda$ balances the importance of aligning the attention distributions with the primary task.
Attention Alignment: The paper discusses how to compute per-word attention in VLMs and DMs, noting that although both employ attention to capture vision-text interplay, their attention aggregation differs.
- Attention Aggregation in Diffusion Models: Text-guided diffusion models generate images from textual input by iteratively denoising a random-noise image. During each denoising step, cross-attention layers enable the model to focus on relevant textual tokens. The resulting per-word attention distributions $p_{\text{DM}(a | x, y; \theta_D)}$ indicate salient image regions for each token.
- Attention Aggregation in Vision-LLMs: Vision-LLMs (VLMs) use transformer attention to connect text tokens with image patch tokens across multiple heads and layers, forming attention weights $w_{(t,p)}^{hl}$ . To create per-word saliency maps, the paper aggregates attention across all heads and layers and reshapes the matrix into a grid to reconstruct the spatial layout of the original image. The paper employs various approaches for attention aggregation, including simple aggregation functions (mean/max pooling), attention flow, and learning the attention aggregations.
- Aligner Network: To improve attention alignment between the Vision-LLM (VLM) and the Diffusion Model (DM), the paper introduces a lightweight Aligner network. This network refines the parallel (or aggregated) attention $A_d$ into a single-channel map, making it directly comparable to the DM's attention $p_{\text{DM}(a | x,y;\theta_D)}$ .
Implementation: The paper integrates Lavender with three VLMs—cross-attention VLMs (OpenFlamingo, Llama 3.2-11B-Vision Instruct) and self-attention VLMs (MiniCPM-Llama 3-v2.5)—using Stable Diffusion v1.4 to provide per-word attention targets.
Training Strategies and Dataset Preparation: The paper adopts several strategies to stabilize alignment objectives and preserve a VLM’s pretrained capabilities, including pretraining the Aligner network, attention aggregation and normalization, configurable Aligner network, parameter-efficient fine-tuning (PEFT), and sampling strategies.
Experiments and Results: The authors first validate Lavender on a small-scale setup with OpenFlamingo before scaling it to MiniCPMv2.5 and Llama 3.2-11B-Vision Instruct. They evaluate these models on 20 standard VLM benchmarks, comparing them against 23 baseline models. They also conduct data overlap analysis, investigate scaling behavior and training efficiency, and examine out-of-distribution generalization in medical QA benchmarks.
- Empirical Verification: The paper validates the hypothesis that DM attention distributions are more concentrated and closely approximate an ideal posterior attention for VLMs. The results show that the DM’s attention entropy is consistently lower, and that Lavender aligns VLM attention with DM attention. Jointly minimizing $L_{\text{VLM}}$ and $L_{\text{att}}$ improves text generation.
- Scaled Results with Lavender: The paper scales experiments using the Llama 3-based MiniCPMv2.5 and Llama 3.2-11B-Vision-Instruct implementations. Lavender is evaluated across 20 VLM benchmarks and compared against 23 baseline VLMs. The results show that Lavender improves performance on 16 out of 20 tasks by up to 4\% on MiniCPM-V-2.5. Lavender outperforms autoregressive fine-tuning by up to 30% on 19 out of 20 benchmarks with LoRA on Llama 3.2-11B-Vision-Instruct.
- Data Overlapping Analysis: The paper performs an ad-hoc qualitative analysis based on the composition of fine-tuning and benchmark datasets to reflect their potential overlapping. The qualitative results suggest that Lavender's fine-tuning dataset shows an overlap score on the lower end, demonstrating its strong generalizability.
- Scaling Behavior: The findings indicate that Lavender scales better with increased data, effectively reducing overfitting.
- Severely Out-of-Distribution Medical Benchmark: The paper evaluates model generalization on the extreme OOD WorldMedQA-V, showing that Lavender improves Llama-3.2-11B's performance by 68%.
- Qualitative Results with Llama 3.2-11B: The results indicate that the aligned VLM attention maps generally correlate with the semantic regions of corresponding words in a manner similar to the Diffusion Model. Lavender demonstrates improved visual understanding compared to the original Llama-3.2 across various VQA tasks.
Ablation and Analysis: The authors conducted extensive ablation studies to assess the key components of Lavender and their impact on performance. Key findings include that learned aggregation consistently outperforms manual methods, pretraining the Aligner significantly mitigates catastrophic forgetting, LoRA achieves better short-term results, and aligning all cross-attention layers in Llama-3.2-11B proves most effective.
Failure Strategies: The paper discusses strategies that were tested and found to be less effective, including fully finetuning without pretraining, frequent switching between training strategies, and mixing extra data.
Limitations and Future Work: The paper identifies limitations of the study, including limited compute, exploring higher-resolution diffusion models, more accurate attention map extraction, and improved handling of self-attention-only models.
Broad Impacts: The paper discusses the potential broad impacts of Lavender, including synergy between models, addressing data scarcity, data privacy, VLM attention alignment with other vision foundation models, alternative multimodal modalities beyond vision–language, and attention alignment as vision feedback in reinforcement learning.
Conclusion: The paper concludes that Lavender effectively aligns VLM attention with Diffusion Models, improving robustness across diverse domains. The authors suggest that this approach demonstrates potential for advancing text-vision alignment in multimodal LLMs and lays the groundwork for future research into efficient fine-tuning and cross-modal alignment techniques.