DimFusion: Efficient Text-Image Fusion
- DimFusion is a cross-modal fusion mechanism that injects deep LLM features into each denoising block of diffusion transformers.
- It concatenates projected LLM representations with text embeddings and then slices them to maintain fixed sequence lengths while processing long captions.
- DimFusion improves computational efficiency—achieving around a 1.6× speedup—and enhances image prompt adherence with fewer additional parameters.
DimFusion is a cross-modal fusion mechanism designed to efficiently inject deep and structured text representations into text-to-image diffusion models, particularly those conditioned on very long captions (up to or beyond 1,000 words). Developed in the context of the FIBO model, DimFusion achieves layer-wise injection of intermediate LLM features into each denoising block of a transformer-based diffusion model. Its key innovation is the fusion of these representations by increasing the embedding dimension rather than the sequence length—enabling efficient, expressive conditioning on long-form, attribute-rich textual inputs while maintaining quadratic attention costs at manageable levels.
1. DimFusion Architecture
DimFusion mediates the interaction between a lightweight LLM (e.g., SmolLM3-3B) and a diffusion transformer backbone (UNet-style or DiT-style). For each diffusion block , DimFusion operates as follows:
- The input text, tokenized into tokens, is represented by embeddings of dimension , where is the diffusion model’s hidden size.
- For each block , the hidden states from the LLM's -th layer () are projected to dimensions via a learnable linear map.
- These projected representations are concatenated with the prior block’s text embeddings along the embedding dimension, creating a fused text feature of dimension .
- The fused text feature, together with the corresponding image latents , is passed through the block's cross/self-attention mechanism.
- After the block, the text embeddings are sliced (retaining the first dimensions), restoring the text representation to its original dimensionality for the next block.
This approach yields a bi-directional or single-stream architecture, compatible with both standard and advanced diffusion transformers.
2. Mathematical Foundations and Fusion Operations
The DimFusion layer at block is mathematically formalized as:
where is the block-specific projection.
In implementation, this corresponds to:
1 2 3 4 5 |
Z_b = H_llm[b] @ Wp[b] + bp[b] # (L × d_t) T_fuse = concat(T_prev, Z_b, axis=-1) # (L × D) (T_out, X_out) = DiffusionBlock(T_fuse, X_prev) T_next = T_out[:, :d_t] # (L × d_t) X_next = X_out |
This operation is performed for each fusion-enabled block and involves only concatenation and slicing along the embedding axis—distinct from sequence-axis concatenation.
3. Parameterization and Learnable Layers
DimFusion introduces minimal overhead in parameter count compared to the base diffusion transformer. For each fusion block :
- There is a unique projection matrix and bias vector .
- Optionally, adapters or gating vectors may modulate the fusion of intermediate and final features, although these are not required.
- No new sequence tokens are introduced, and positional encodings remain untouched.
The total parameter increase is proportional to , where is the number of fusion blocks. In FIBO, approximately half of the diffusion blocks are dual-stream (attending jointly over text and image as in SDXL), while the remainder are single-stream.
4. Computational Efficiency for Long Captions
DimFusion improves computational efficiency relative to token-level fusion methods. In token-fusion, merging each intermediate LLM layer’s outputs into the sequence would expand sequence length from to , leading to attention costs scaling as . For long captions (large ), this results in multiplicative increases in memory and compute.
By maintaining a fixed sequence length and increasing only the embedding dimension locally (from to $2D$ in the text branch at each block), DimFusion reduces the dominant quadratic cost from sequence extension. The immediate slicing operation after each block restores the embedding size without compounding compute across blocks.
Empirical ablation results with a 1B-parameter DiT-style model:
| Fusion Type | Wall-clock per step | FID |
|---|---|---|
| TokenFusion | 0.8 seconds | 15.90 |
| DimFusion | 0.5 seconds | 15.58 |
This demonstrates an approximately 1.6× speedup for comparable or improved generation quality.
5. Comparison to Sequence Token-Level Conditioning
Standard text-to-image diffusion models employ final-layer LLM text encodings as conditioning, with injection via cross-attention in limited layers, while token-fusion approaches concatenate intermediate LLM representations as extra tokens, substantially increasing sequence length and compute. Token-fusion, as in "HiDream-I1", achieves more granular control but at high computational cost.
DimFusion, by contrast, achieves:
- Deep, block-wise injection of multi-layer LLM features while preserving fixed sequence length.
- Lower quadratic compute cost compared to token-fusion at scale.
- Improved image prompt adherence (as measured by FID and prompt-alignment metrics) and stronger disentangled control, particularly for long-form, structured captions, as validated by TaBR and PRISM-Bench.
6. Practical Implementation Considerations
To instantiate DimFusion within a diffusion transformer:
- Introduce per-block linear projection layers for projecting LLM hidden states.
- Modify each enabled transformer block to perform concatenation along the embedding dimension, followed by attention, and subsequent dimensionality reduction via slicing.
- Reuse positional embeddings and existing diffusion block architectures without expanding sequence dimension.
- Train the model end-to-end as in the FIBO paradigm, injecting both intermediate and final LLM features as prescribed.
No modification to image latent path or injection of additional text tokens is required. This design ensures scalability to very long, structured captions as input, which is key for professional-grade text-to-image synthesis scenarios demanding high expressiveness and controllability.
7. Context and Related Work
DimFusion draws from attention mechanism innovations in text-to-image synthesis, extending bi-stream fusion schemes such as SDXL [esser2024scaling] and cross-attention UNet architectures [Rombach et al.]. It addresses inefficiencies in approaches like "HiDream-I1" [Cai et al.], which rely on token-fusion along the sequence axis. By efficiently leveraging intermediate LLM representations, DimFusion realizes scalable, high-fidelity alignment between detailed textual inputs and generated imagery, as evidenced in the FIBO model's performance on prompt-alignment and reconstruction metrics.
A plausible implication is that future diffusion-based multi-modal architectures may generalize the underlying "embedding-concatenation-then-slice" schema for other high-dimensional conditioning modalities beyond text.