Qwen-Image: Multimodal Image Generation & Editing

Updated 5 August 2025

Qwen-Image is a multimodal model that integrates high-fidelity image synthesis and precise text rendering for both alphabetic and logographic content.
It employs a multi-stage data pipeline, curriculum training, and dual-encoding architecture to optimize image generation and text layout accuracy.
The model sets new benchmarks in Chinese text rendering and image editing, providing robust performance for complex, instruction-based tasks.

Qwen-Image is a foundation model for image generation and editing within the Qwen multimodal AI series, focusing on advanced text rendering—including logographic Chinese—and high-fidelity image editing. Its technical innovations span a comprehensive data pipeline, curriculum-based progressive training, a multi-task model architecture, and a dual-encoding strategy that jointly leverages semantic and reconstructive representations. These design choices enable Qwen-Image to set new benchmarks in both naturalistic image synthesis and state-of-the-art text rendering, particularly excelling in challenging scenarios such as paragraph-level Chinese composition and instruction-based image editing (Wu et al., 4 Aug 2025).

1. Data Pipeline Engineering and Curation

Qwen-Image relies on a multi-stage, quality-centric data pipeline to address the dual demands of text rendering and image generation. The pipeline operates across the following stages:

Data Collection spans domains such as Nature, Design, People, and Synthetic Data, accumulating billions of image–text pairs.
Multistage Filtering includes resolution and corruption checks, deduplication, and NSFW filtering (Stage 1); automatic image enhancement using blur, brightness, and texture normalization (Stage 2); and text-image alignment improvement by creating splits for Raw Caption, Recaption (using a Qwen-VL captioner), and Fused Caption alignment (Stage 3).
Text Rendering Data Enhancement (Stage 4) introduces:
- Pure Rendering (e.g., dynamic font layout on homogeneous backgrounds),
- Compositional Rendering (text embedded in photorealistic scenes),
- Complex Rendering (structured templates mimicking slides or multi-block layouts).
- The pipeline divides the data based on language and presence of text, and supplements with synthetic text-rich samples critical for rare Chinese characters and complex layout generalization.
Annotation and Balancing uses Qwen2.5-VL for structured captions and instance-level metadata and enforces balanced sampling and multi-resolution training up to 1328×1328 resolution.

This strategy ensures the resulting dataset is not only massive but exhibits the linguistic and visual diversity required to optimize text rendering for both alphabetic and logographic scripts.

2. Progressive Curriculum Training for Text Rendering

Qwen-Image applies a curriculum learning approach that incrementally scales the complexity of its training objectives:

Initial Phases focus on non-text and simple-text images, allowing the model to master basic generative visual priors.
Intermediate Phases introduce synthetic data with short and single-line text. Here, precise layout and glyph accuracy for both English and Chinese are emphasized.
Advanced Phases present paragraph-level, multicolumn, and slide-template text compositions, requiring globally consistent and layout-sensitive rendering.

The curriculum is tightly coupled with multi-resolution sampling, beginning from low-resolution images and advancing towards high-resolution regimes to hone small character rendering, crucial for dense scripts like Chinese.

A direct consequence is significantly higher character and word accuracy, particularly on challenging benchmarks such as ChineseWord (overall accuracy ≈ 58.3%) and LongText-Bench, compared to prior models. The curriculum mitigates common failure modes of omitted, duplicated, or swapped characters—a chronic issue in prior T2I systems for logographic languages (Wu et al., 4 Aug 2025).

3. Multi-task Model Architecture and Training Paradigm

Qwen-Image supports three primary modeling paradigms within a unified architecture:

Text-to-Image (T2I) Generation: Prompt-based unconditional synthesis.
Text-Image-to-Image (TI2I) Editing: Instruction-driven modifications of existing images (object insertion/removal, style transfer, in-image text rewriting).
Image-to-Image (I2I) Reconstruction: High-fidelity self-reconstruction to align and regularize latent representations.

The multi-task learning pipeline jointly optimizes for all three, with tasks interleaved per batch. The core generator is a Multimodal Diffusion Transformer (MMDiT), operating over a shared latent space to unify editing and synthesis.

The loss for flow-matching training is expressed as: $\mathcal{L} = \mathbb{E}_{(x_0, h)\sim\mathcal{D},\, x_1\sim\mathcal{N}(0,I),\, t\sim p(t)} \Big\| v_{\theta}(x_t,t,h) - (x_0 - x_1) \Big\|^2$ where $x_t = t\,x_0 + (1-t)\,x_1$ and $v_{\theta}$ is the MMDiT’s predicted flow at time $t$ .

This paradigm enables explicit conditioning on both the textual and visual context, significantly improving editing locality and instruction following.

4. Dual-Encoding Pathway: Semantic and Reconstructive Representations

For editing consistency and semantic grounding, Qwen-Image independently encodes each input image using:

Qwen2.5-VL—a multimodal LLM extracting high-level semantics and relational context from both the text prompt and (optionally) image regions. This ensures comprehension of user intent and linguistic nuances, crucial for complex editing operations.
VAE Encoder—a variational autoencoder that provides compressive, reconstructive embeddings preserving color, layout, and fine-grained structure (such as text glyphs or spatial motifs).

During image editing, these representations are fused—typically by concatenation or cross-attention within the MMDiT—and aligned using Multimodal Scalable RoPE (MSRoPE) positional encoding. This design enables the editing module to balance semantic consistency (faithfulness to the source intent) and visual fidelity (pixel-level coherence).

This dual-pathway supports high-quality inpainting, targeted region editing, and style adaptation, maintaining both local and global image coherency even on challenging, high-resolution content.

5. Performance Across Benchmarks

Qwen-Image is evaluated on a comprehensive range of public and bespoke multimodal benchmarks:

Benchmark	Domain	Qwen-Image Highlighted Capability
DPG	Prompt Adherence	Highest scores for entity, attribute, relation
CVTG-2K	Text Rendering	State-of-the-art word accuracy and NED
ChineseWord	Chinese Text	Top-tier character/word accuracy on logographic
LongText-Bench	Layout Text	Leading long-form and paragraph-level text render
GEdit/ImgEdit	Editing	Best-in-class for style and content manipulation
OneIG-Bench	Generation	SOTA for attribute and relation synthesis

Qwen-Image demonstrates robust semantic alignment (high CLIPScores), image and layout fidelity (peak SSIM/PSNR/LPIPS for reconstruction/editing), and unprecedented text rendering accuracy for both alphabetic and logographic scripts. The model’s editing module displays consistent region-localized transformations with minimal semantic drift, aided by strong joint latent alignment from its dual-encoding mechanism.

6. Text Rendering in Logographic and Multilingual Contexts

A defining capability of Qwen-Image is high-fidelity text rendering, especially in Chinese and other logographic languages, which traditionally present severe challenges due to character set size, frequency imbalance, and layout complexity.

The technical innovations underlying this result include:

Extensive long-tail glyph coverage via data balancing and synthetic augmentation.
Dynamic font size/layout adjustment algorithms to minimize missing or obscured characters.
Structured template-based synthetic data to train on realistic multi-line, irregular, and mixed-language text.
Progressive curriculum that mitigates the “overfitting” of the model to common alphabetic patterns early in training, with deferred exposure to complex logographic/structured compositions.

In benchmark head-to-heads, Qwen-Image addresses common failure points such as glyph omission, character swaps, and omitted rows, substantially outperforming previous diffusion and autoregressive text-to-image generation baselines.

7. Implications and Future Directions

Qwen-Image establishes a new standard for foundation image generation models with advanced multilingual and logographic text rendering, precise region editing, and consistency across multi-task conditioning. Its contribution is not limited to standard generation/editing but also provides a template for future multimodal models requiring semantically coupled, fine-grained, and high-resolution control across both natural and structured visual content.

A plausible implication is that the curriculum-based, dual-encoding, and multi-task architecture employed here may generalize to other demanding modalities such as complex diagram or document synthesis, video frame editing, or multimodal agent frameworks demanding tightly aligned semantic and low-level visual control.

Qwen-Image’s documented architecture, benchmarking, and supporting public resources enable rigorous reproducibility and downstream adaptation, providing an open foundation for research and industrial applications where text rendering fidelity and instruction-following are paramount (Wu et al., 4 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Qwen-Image Technical Report (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Qwen-Image.