Contrastive-FLAP Pruning Algorithm in Diffusion Models

Updated 11 November 2025

Contrastive-FLAP Pruning Algorithm is a fusion-based approach that integrates intermediate LLM hidden states into diffusion models to improve prompt alignment.
It employs channel-wise concatenation and tailored linear projections to merge text embeddings with image latents, maintaining low computational cost even for long captions.
Empirical results indicate the algorithm significantly reduces step time and enhances FID performance compared to traditional token-level fusion techniques.

The DimFusion fusion mechanism, as described in "Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions" (Gutflaish et al., 10 Nov 2025), enables efficient and expressive integration of long structured captions into transformer-based diffusion models. DimFusion is engineered to maximize textual coverage and fine-grained control by injecting intermediate hidden states from a lightweight LLM into the generative process, while avoiding the computational overhead associated with longer token sequences. This architecture supports input captions of up to approximately 1,800 tokens and is instrumental in the state-of-the-art performance of the FIBO text-to-image model.

1. Architecture and Data Flow

DimFusion operates as a connector between a lightweight LLM (e.g., SmolLM3-3B) and a diffusion-transformer architecture (such as a Stable-Diffusion-style UNet or a DiT). The mechanism proceeds as follows:

The input caption, with potentially high token count ( $L$ ), is encoded by the LLM, generating a stack of hidden states $H^{(\ell)} \in \mathbb{R}^{B \times L \times d_\ell}$ for layers $\ell=1\ldots L$ .
A running text embedding $E^{(i)} \in \mathbb{R}^{B \times L \times (D/2)}$ is maintained and injected into each diffusion block.
At each block $i$ , DimFusion retrieves a particular LLM layer $H^{(\ell_i)}$ , projects it to have $D/2$ channels, concatenates it with $E^{(i-1)}$ resulting in $F^{(i)} \in \mathbb{R}^{B \times L \times D}$ (where $D$ is the hidden width of the diffusion transformer block).
This fused embedding $F^{(i)}$ is processed jointly with the current image latents by the diffusion block, which may be either a dual-stream (joint cross-attention) or single-stream (self-attention) transformer block.
After processing, the extra $D/2$ channels introduced at this stage are discarded to obtain the updated embedding $E^{(i)}$ for the next block.

This design allows DimFusion to leverage both early and late LLM representations without increasing the effective token count, facilitating joint reasoning between textual and visual modalities.

2. Fusion Formulation and Key Equations

The operational details of DimFusion are precisely defined by the following scheme:

Initial Embedding Fusion:

$E^{(0)} = \mathrm{Linear}_0(\mathrm{concat}(H^{(\ell_0)}, H^{(\ell_1)}))$

with trainable projection parameters $W_0 \in \mathbb{R}^{(d_{\ell_0} + d_{\ell_1}) \times (D/2)}$ , $b_0 \in \mathbb{R}^{D/2}$ .

Per-Block Channel Fusion:
- At block $i$ , project LLM output:
$P_i = \mathrm{Linear}_i(H^{(\ell_i)}) \in \mathbb{R}^{B \times L \times (D/2)}$

Concatenate with the previous embedding:

$F^{(i)} = \mathrm{concat}(E^{(i-1)}, P_i) \in \mathbb{R}^{B \times L \times D}$
Jointly process with image tokens $Z$ :

$[\widetilde{E}^{(i)}, \widetilde{Z}^{(i)}] = \mathrm{DiffBlock}_i(F^{(i)}, Z)$
Slice off the last $D/2$ channels to propagate the next embedding:

$E^{(i)} = \widetilde{E}^{(i)}[:, :, 1:(D/2)]$

At every diffusion block, the text representation is dynamically updated by integrating representations from different LLM layers, enabling both shallow ("surface form") and deep ("compositional meaning") semantic information to interact across the denoising trajectory.

3. Learnable Parameters and Model Modification

Relative to a baseline without intermediate fusion, DimFusion introduces minimal additional learnable parameters:

A small projection Linear₀ for initial embedding formation between the last two LLM layers.
Separate per-block projection layers Linear $_i$ for each selected LLM hidden state.
Each weight matrix $W_i$ is in $\mathbb{R}^{d_{\ell_i} \times (D/2)}$ with a corresponding bias $b_i \in \mathbb{R}^{D/2}$ .
All other architecture components (diffusion transformer blocks) remain unchanged.

The overhead is $O(L \cdot D^2)$ parameters total, which is negligible compared to overall model parameters in large-scale configurations (8B–20B parameters). This design choice ensures that the model benefits from deeper LLM-derived features while maintaining an efficient parameter footprint.

4. Efficiency with Long Captions

DimFusion is explicitly designed to support long structured captions ( $L \sim 1,800$ ) without inflating quadratic attention costs. Standard self- and cross-attention scales as $O(N^2 D)$ where $N$ is the sequence length; typical token-wise fusion strategies such as TokenFusion double the number of text tokens ( $N \approx 2L + H$ ), leading to a computational cost proportional to $4L^2D$ when $L \gg H$ .

In contrast, DimFusion concatenates along the channel dimension, keeping $N = L + H$ fixed. The temporary increase in dimensionality (from $D/2$ to $D$ per token) does not affect attention scaling, preserving an $O((L+H)^2 D)$ cost. Empirical results on a 1B-parameter ablation demonstrate that DimFusion reduces per-step forward and backward time from 0.8 s (TokenFusion) to 0.5 s ( $\pm$ 10%), with equal or better FID (15.58 vs. 15.90). Thus, avoiding token count growth is critical for scalability, especially when processing captions where $L$ exceeds 1,000.

5. Comparative Evaluation

The comparative analysis in (Gutflaish et al., 10 Nov 2025) reports the following performance and resource results:

Conditioning Method	FID (COCO)	Step Time	Strengths/Weaknesses
No fusion	~36.5	Ref.	Only final LLM layer, slow convergence, weak control
TokenFusion	~15.9	0.8 s	Access to deep layer features, but step time increased by ~1.6x
DimFusion	~15.6	0.5 s	Best prompt alignment and controllability; ~1.6x faster

DimFusion achieves the best trade-off between prompt alignment (evaluated by PRISM-Bench), disentanglement (under iterative JSON edits), and computational efficiency. By integrating intermediate LLM layers throughout the denoising process, it circumvents both the representational limitations of single-layer text embeddings and the prohibitive cost of token-level fusion. In practice, this enables state-of-the-art performance for models trained on extensive structured captions as demonstrated by FIBO.

6. Context and Significance in Text-to-Image Modeling

The development of DimFusion addresses the long-standing mismatch between the brevity of user prompts and the descriptive richness required for high-fidelity image synthesis. By efficiently incorporating information from all layers of the LLM, DimFusion enhances both expressivity and controllability, crucial for professional applications requiring precise semantic tuning.

In the context of the FIBO model, DimFusion supports training on structured captions with expansive attribute coverage, reflecting a design intended for open-source accessibility and large-scale deployment. The resulting improvements in prompt alignment and latent caption reconstruction, as substantiated by the TaBR protocol, mark a significant advancement in the field of text-to-image generation.

A plausible implication is that channel-wise fusion mechanisms such as DimFusion may generalize to other sequence-to-sequence tasks involving long and highly structured textual inputs, where inference efficiency and attribute disentanglement are critical.

Markdown Upgrade to Chat

References (1)

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive-FLAP Pruning Algorithm.

Contrastive-FLAP Pruning Algorithm in Diffusion Models

1. Architecture and Data Flow

2. Fusion Formulation and Key Equations

3. Learnable Parameters and Model Modification

4. Efficiency with Long Captions

5. Comparative Evaluation

6. Context and Significance in Text-to-Image Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Contrastive-FLAP Pruning Algorithm in Diffusion Models

1. Architecture and Data Flow

2. Fusion Formulation and Key Equations

3. Learnable Parameters and Model Modification

4. Efficiency with Long Captions

5. Comparative Evaluation

6. Context and Significance in Text-to-Image Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research