- The paper presents HiDream-I1, a foundation model with 17 billion parameters that uses a sparse Diffusion Transformer and latent flow matching to balance image quality and speed.
- It integrates a hybrid text encoding system and a multi-stage training strategy with rigorous data pre-processing to achieve robust prompt adherence and human preference scores.
- The model extends its capabilities to precise image editing and interactive multimodal applications, driving forward practical advances in automated image generation.
The paper introduces HiDream-I1 (2505.22705), a new open-source image generative foundation model designed to address the trade-off between image generation quality and computational efficiency. With 17 billion parameters, HiDream-I1 aims to produce state-of-the-art image quality quickly, within seconds, by employing a novel sparse Diffusion Transformer (DiT) architecture.
Key Concepts and Architecture:
HiDream-I1 is built upon the principles of Flow Matching (2210.02747), learning a continuous-time transformation from noise to data by modeling a velocity field. It operates in a latent space using a pre-trained VAE for efficiency, encoding images into latent representations for training. The core architecture consists of a hybrid text encoding module and a unique sparse DiT backbone.
- Hybrid Text Encoding: To capture rich textual understanding, HiDream-I1 integrates features from multiple text encoders:
- Long-Context CLIP (L/14 and G/14) (2401.08281) for robust global visual grounding embeddings.
- T5-XXL encoder for parsing complex text structures.
- Intermediate layer features from a powerful decoder-only LLM (Llama 3.1 8B Instruct) for deep semantic details.
These embeddings are processed and concatenated to form the primary text conditioning signal.
- Sparse Diffusion Transformer (DiT) Backbone: The DiT backbone processes patchified latent representations.
- Dual-stream DiT Blocks: Initial layers process image tokens and text tokens independently in parallel pathways, similar to MMDiT (2407.18691).
- Single-stream DiT Blocks: After the dual-stream layers, image and text tokens are concatenated, and subsequent layers operate on this combined sequence.
- Sparse Mixture-of-Experts (MoE): Integrated into both dual-stream and single-stream blocks, replacing standard feed-forward networks. A gating network dynamically routes each token to a subset of specialized FFN experts, including a shared expert. This uses SwiGLU activations (2002.05202) within experts, allowing significant capacity increase while controlling computation.
- Conditioning: Global conditioning from CLIP features and timestep embeddings is injected via adaptive layer normalization (adaLN). QK-normalization (2407.18691) is used for training stability.
Data and Training:
The model's capabilities are significantly influenced by its training data. HiDream-I1 uses a rigorous data pre-processing pipeline:
- Collection: Aggregating web-sourced and internal copyrighted images, focusing on diversity.
- Deduplication: A two-stage process using SSCD (2009.08924) for feature extraction, k-means clustering, and GPU-accelerated Faiss (2401.08281) for intra-cluster near-duplicate removal, eliminating about 20% of data.
- Filtering: Applying content safety (NSFW), aesthetic quality [laion2024aesthetic], watermark detection [laion2024watermark], and technical quality filters (Top-IQ (2403.17561), bytes-per-pixel).
- Annotation: Using the MiniCPM-V 2.6 VLM [openbmb2024minicpm] for automatic, detailed captioning, conditioned on image content and existing metadata, with specific prompting to control style and length.
Training involves a multi-stage strategy:
- Pre-training: Trains the sparse DiT backbone using Latent Flow Matching on pre-computed VAE latents. This involves progressive resolution training, starting at 256x256, then 512x512, and finally 1024x1024. Training utilizes AdamW (1711.05101), FSDP (1912.01703), mixed-precision, and gradient checkpointing.
- Post-Training: Fine-tunes on a high-quality, human-annotated dataset to improve prompt fidelity, aesthetics, and preference alignment.
Inference Acceleration and Extensions:
To enable faster inference, the paper introduces accelerated variants through GAN-powered Diffusion Model Distillation:
- HiDream-I1-Dev (28 steps) and HiDream-I1-Fast (14 steps) are distilled from the full HiDream-I1-Full model (>50 steps).
- The distillation objective combines the standard Diffusion Model Distillation (DMD) loss (2407.09985) with an adversarial loss (Ladv). A discriminator, leveraging frozen teacher backbone features, is trained to distinguish between real images and those generated by the student model (decoded from latents), encouraging the student to produce perceptually sharp images.
Beyond text-to-image generation, HiDream-I1 is extended for other applications:
- HiDream-E1 (Instruction-based Image Editing): Fine-tuned from HiDream-I1, this model takes a source image and a text instruction to perform precise edits. It is trained on (source image, instruction, target image) triplets. During training, source and target latents are spatially concatenated and the model learns to generate the target latent conditioned on the source latent and instruction. A spatially weighted loss focuses learning on regions of difference between source and target.
- HiDream-A1 (Image Agent): An interactive multimodal system that integrates HiDream-I1 (generation) and HiDream-E1 (editing) within a conversational interface. A Coordinator module manages user input (text/visual), routing tasks to a Planner (for generation/editing) or a Chat module.
Evaluation:
HiDream-I1 and HiDream-E1 were evaluated against state-of-the-art models on several benchmarks:
- HiDream-I1 (Text-to-Image):
- Prompt Adherence: Evaluated on DPG-Bench (2403.05135) and GenEval (2309.16304). HiDream-I1 achieved the highest overall scores on both, demonstrating strong semantic understanding and compositional accuracy, particularly excelling in interpreting relationships and detailed instructions (DPG-Bench) and performing single-object, two-object, counting, and color-related tasks (GenEval).
- Human Preference: Evaluated using HPSv2.1 (2306.09341). HiDream-I1 achieved the highest average score and ranked first across all tested styles (Animation, Concept Art, Painting, Photo), indicating a strong alignment with human visual preferences.
- HiDream-E1 (Image Editing): Evaluated on EmuEdit (2404.13234) and ReasonEdit (2403.08349) using an automated GPT-4o metric assessing edit success and preservation of unchanged areas. HiDream-E1 obtained the highest overall average scores on both benchmarks, showcasing its proficiency in handling complex instructions and performing precise edits across various tasks (Global, Text, Color, Style, Remove, Local edits on EmuEdit).
The paper concludes by releasing the code and model weights for HiDream-I1 (all variants) and HiDream-E1, along with a live demo, to foster further research and application development in multimodal AIGC.