DIM-4.6B-T2I/Edit: Unified T2I & Editing Model

Updated 4 July 2026

The paper introduces chain-of-thought blueprinting to explicitly delegate design tasks, improving editing precision and compositionality.
The model employs a frozen Qwen2.5-VL-3B for reasoning, a trainable SANA1.5-1.6B for painting, and a lightweight two-layer MLP connector for cross-modal projection.
Empirical benchmarks demonstrate competitive FID and GenEval scores while using an order of magnitude fewer trainable parameters than comparable models.

DIM-4.6B-T2I/Edit is a unified, connector-based multimodal model for text-to-image generation and instruction-guided image editing that was introduced in “Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination” (Zeng et al., 2 Sep 2025). Its central claim is that current unified systems are strong at text-to-image generation but weak at precise, instruction-following image editing because the understanding module primarily functions as a translator while the generation module must simultaneously infer the original layout, identify what and where to edit, plan the new layout or content, and perform the rendering. DIM addresses this imbalance by explicitly assigning design responsibility to the understanding module through chain-of-thought imaginations and reserving the generation module for painting, yielding a compact system built from a frozen Qwen2.5-VL-3B, a trainable SANA1.5-1.6B, and a lightweight two-layer MLP connector (Zeng et al., 2 Sep 2025).

1. Conceptual basis and Draw-In-Mind paradigm

DIM-4.6B-T2I/Edit is organized around the Draw-In-Mind paradigm, which argues that existing connector-based unified models such as Step1X-Edit and UniWorld-V1 under-utilize the understanding module and overburden the generator (Zeng et al., 2 Sep 2025). In those systems, the understanding module typically acts mostly as a translator that converts image-and-instruction input into semantic condition tokens, while the generation module must act as both designer and painter. The DIM formulation treats that division of labor as counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module.

The proposed remedy is explicit design delegation. The understanding module is asked to produce chain-of-thought imaginations: detailed, step-wise textual blueprints of the desired edit. The paper frames this as closer to human practice, in which a mental blueprint is formed before drawing. In DIM, the generator is therefore conditioned on a much richer specification of what should be preserved, what should be modified, and what the edited image should finally depict (Zeng et al., 2 Sep 2025).

This design choice places DIM within a broader shift from one-shot semantic conditioning toward staged planning. A plausible implication is that DIM internalizes, during training, a role separation that appears in modular planning-and-editing frameworks such as “GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis” (Goswami et al., 2024), but does so inside a unified T2I/edit model rather than as a purely training-free orchestration layer.

2. Model architecture and division of responsibilities

DIM-4.6B-T2I/Edit uses the same core backbone for both text-to-image generation and image editing: a frozen Qwen2.5-VL-3B as the understanding and reasoning module, a trainable SANA1.5-1.6B diffusion transformer as the image synthesis module, and a two-layer MLP connector that projects multimodal tokens into the DiT conditioning space (Zeng et al., 2 Sep 2025). The “4.6B” refers to the approximate total number of parameters: 3.0B in the frozen Qwen2.5-VL-3B and 1.6B in the trainable SANA1.5-1.6B, while the connector is comparatively tiny and deliberately lightweight.

The architectural distinction is functional rather than merely modular. Qwen2.5-VL-3B is responsible for long-context, multi-step reasoning, vision-language understanding, and producing multimodal tokens that encode both the user instruction and the detailed design blueprint. The two-layer MLP performs the cross-model projection. SANA1.5-1.6B is then conditioned on those projected tokens and is tasked with painting only. In editing mode, the generator receives a concatenation of the source image and noise along the channel dimension, following the InstructPix2Pix-style formulation described in the paper (Zeng et al., 2 Sep 2025).

At inference time, the chain-of-thought blueprint is concatenated with the edit instruction into a single conditioning string,

$T_\text{full} = T_{\text{inst}} \ \|\ \text{GLP} \ \|\ \text{LOP} \ \|\ \text{EAL} \ \|\ \text{EII},$

where GLP is Global Layout Perception, LOP is Local Object Perception, EAL is Edit Area Localization, and EII is Edited Image Imagination. Qwen2.5-VL-3B consumes the source image and $T_\text{full}$ , outputs multimodal tokens $\mathbf{h}$ , and the connector maps them by

$\mathbf{z} = \text{MLP}(\mathbf{h}).$

The generator then edits from an augmented input

$x_0 = \text{concat}(I_\text{src}, \text{noise}),$

conditioned on $\mathbf{z}$ (Zeng et al., 2 Sep 2025).

This arrangement differs from methods that attach control or optimization machinery directly to the denoising process. A plausible implication is that DIM is complementary to trajectory-centric editing methods such as “TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing” (Chen et al., 2024), because DIM addresses design specification while TiNO-Edit addresses diffusion trajectory selection.

3. DIM dataset: DIM-T2I and DIM-Edit

The model’s behavior is tightly coupled to the DIM dataset, which has two complementary parts: DIM-T2I and DIM-Edit (Zeng et al., 2 Sep 2025). DIM-T2I contains 14M long-context image-text pairs and is intended to teach the system to handle detailed, multi-aspect descriptions. Its annotations are generated across 21 dimensions: Character Name, Scene Description, Actions and Interactions, Context and Environment, Emotion and Sentiment, Relationships and Spatial Arrangement, Color and Texture, Symbolism or Abstract Interpretation, Lighting and Shadows, Details and Fine Elements, Perspective and Composition, Time and Season, Target Audience, OCR, Person Description, Mathematics, Information Extraction, Planning, Science, Perception, and Metrics. The average prompt length is 146.76 words.

DIM-Edit contains 233K edit instances with chain-of-thought imaginations generated by GPT-4o (Zeng et al., 2 Sep 2025). It is assembled from three sources: UltraEdit-160K-CoT, ShareGPT-4o-Image-CoT, and HumanEdit-CoT. UltraEdit-160K-CoT is a filtered 160K subset of UltraEdit selected using SSIM, DINOv2 similarity, and CLIP similarity; ShareGPT-4o-Image-CoT contributes 46K semantically rich samples; HumanEdit-CoT adds 8K MagicBrush training edits and 19K SEED-Data-Edit-Part3 edits, emphasizing remove operations and high-fidelity manual edits. The average prompt length in DIM-Edit is 252.64 words.

A key part of DIM-Edit is its prompt optimization and alignment pipeline. For each raw source image, target image, and raw prompt triple, GPT-4o first judges whether the prompt is misaligned, partially aligned, or aligned with the actual edit. Misaligned examples are discarded. Partially aligned examples are rewritten so that the prompt covers all actual changes. Aligned prompts are refined to disambiguate underspecified phrases. GPT-4o then produces a four-step chain-of-thought imagination:

Global Layout Perception (GLP): describe all key objects and their positions in the source image.
Local Object Perception (LOP): describe appearance, including shape, color, texture, and state.
Edit Area Localization (EAL): specify which objects or regions will be modified.
Edited Image Imagination (EII): describe what the edited image should look like, focusing on modified and preserved areas.

The intended complementarity is explicit. DIM-T2I builds the base T2I model that can reason over long, detailed descriptions, while DIM-Edit adds explicit CoT supervision for editing, teaching the model to condition on blueprints and perform minimal, precise changes consistent with those blueprints (Zeng et al., 2 Sep 2025).

4. Training procedure and inference mechanics

DIM-4.6B-T2I is trained on DIM-T2I together with 6.9M additional image-text pairs from MidJourney-V6, COCO, InstructPix2Pix, JourneyDB, HQ-Edit, and Dimba, while explicitly excluding distillation datasets such as BLIP3-o-60K to avoid data leakage on GenEval (Zeng et al., 2 Sep 2025). Training proceeds in two stages: a Stage-0 connector warmup for 1 epoch at learning rate $2 \times 10^{-5}$ , followed by 8 epochs of joint training of the connector and SANA1.5-1.6B at the same learning rate with batch size 256. Qwen2.5-VL-3B remains frozen throughout.

Editing is then learned in two more stages. Stage 1 uses the 4M UltraEdit dataset without CoT, with batch size 32, 10 epochs, and learning rate $1 \times 10^{-4}$ , again using source-image-plus-noise concatenation as the DiT input. Stage 2 fine-tunes on DIM-Edit with the same architecture but CoT-augmented text, for 50 epochs at learning rate $1 \times 10^{-5}$ (Zeng et al., 2 Sep 2025). During both editing stages, only SANA1.5-1.6B and the two-layer MLP are updated.

The diffusion objective is the standard form

$\mathcal{L}_\text{diff} = \mathbb{E}_{x, t, \epsilon}\big[ \| \epsilon_\theta(x_t, t, c) - \epsilon \|_2^2 \big],$

with the CoT imagination used purely as conditioning text rather than as a supervised output. By construction, the model is implicitly trained to interpret GLP as layout constraints, LOP as appearance constraints, EAL as localization cues, and EII as final-state specification (Zeng et al., 2 Sep 2025).

At inference time, an external designer such as GPT-4o, Qwen2.5-VL-3B, or Qwen2.5-VL-7B generates the four-step CoT blueprint without access to a target image. The source image and CoT-augmented instruction are processed by the frozen Qwen2.5-VL-3B, the MLP projects the resulting multimodal tokens, and SANA1.5-1.6B performs the edit. The paper attributes locality and consistency to explicit EAL in the CoT, detailed preserved-versus-modified descriptions in LOP and EII, and the fact that the DiT always sees the original image in the input stack (Zeng et al., 2 Sep 2025).

5. Empirical performance and ablation findings

DIM-4.6B-T2I is reported as a competitive T2I model in its own right, and DIM-4.6B-Edit is reported as SOTA or competitive on public editing benchmarks (Zeng et al., 2 Sep 2025). The most central quantitative results are summarized below.

Benchmark	DIM result	Comparator context
GenEval Overall	0.77	BAGEL 0.82; Janus-Pro-7B 0.80; MetaQuery-L 0.78
MJHQ-30K FID	5.50	MetaQuery-L 6.35; PixArt-α 6.14; SDXL 8.76
ImgEdit Overall	3.67	UniWorld-V1 3.26; Step1X-Edit 3.06; GPT-4o-Image 4.20
GEdit-Bench-EN Full set	6.18	Step1X-Edit 6.44; OmniGen 5.01; UniWorld-V1 4.85

On T2I benchmarks, DIM-4.6B-T2I with an LLM rewriter obtains GenEval Overall 0.77 and MJHQ-30K FID 5.50, which is the best reported FID in the comparison table (Zeng et al., 2 Sep 2025). The paper also notes particularly strong GenEval component scores in Two Objects, Position, and Color Attribute, supporting the claim that long-context training on DIM-T2I improves compositionality.

On ImgEdit, DIM-4.6B-Edit reaches 3.49 with Qwen2.5-VL-3B as external designer, 3.55 with Qwen2.5-VL-7B, and 3.67 with GPT-4o (Zeng et al., 2 Sep 2025). These scores exceed Step1X-Edit at 3.06, BAGEL at 3.20, UniWorld-V1 at 3.26, and Janus-4o at 3.19, while remaining below GPT-4o-Image at 4.20. The paper emphasizes that this is achieved with 3.0B frozen parameters and 1.6B trainable parameters, roughly an order of magnitude fewer trainable parameters than Step1X-Edit and UniWorld-V1.

On GEdit-Bench-EN, DIM-4.6B-Edit scores 6.18 on the full set, close to Step1X-Edit’s 6.44 and above other out-of-domain open-source editors (Zeng et al., 2 Sep 2025). Excluding Text Change, the picture shifts: Semantic Consistency is 7.08 for DIM versus 7.07 for Step1X-Edit, Perceptual Quality is 6.71 versus 6.91, and the overall average is 6.50 for DIM versus 6.35 for Step1X-Edit. The paper explicitly identifies Text Change as the main weakness because DIM-Edit contains little text-change-focused data.

The ablations support the centrality of CoT supervision. On ImgEdit, full CoT reaches 3.67, while removing GLP yields 3.49, removing LOP yields 3.39, removing EAL yields 3.42, and removing EII yields 3.35 (Zeng et al., 2 Sep 2025). The paper concludes that GLP has the smallest impact, while LOP, EAL, and EII each contribute significantly. Data-composition ablations likewise show that CoT versions of datasets are more useful than non-CoT counterparts, and that combining semantically rich CoT with human-quality edits and consistent AI edits gives the best overall result.

6. Relation to neighboring T2I and image-editing research

DIM-4.6B-T2I/Edit belongs to a wider family of methods that try to improve controllability, compositionality, and edit precision without relying only on larger generators. Its most direct conceptual neighbor is planning-based generation and editing. “GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis” decomposes complex T2I into Generate, Plan, and Edit stages, using an MLLM planner to diagnose failures and produce atomic edit plans (Goswami et al., 2024). A plausible implication is that DIM may be read as a training-time internalization of the planning burden that GraPE externalizes at inference time.

It also differs from control-centric approaches. “T2I-Adapter” augments frozen diffusion backbones with lightweight adapters for structure, color, and other external conditions (Mou et al., 2023), while “Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions” modifies token-level CLIP embeddings to produce continuous subject-specific attribute control without changing the diffusion model itself (Baumann et al., 2024). DIM instead relocates high-level design reasoning into the understanding module, using textual blueprints rather than external control maps or semantic directions as the primary mechanism.

In evaluation terms, DIM’s long-context and CoT-based formulation suggests clear interfaces with recent benchmarks. TIIF-Bench evaluates fine-grained instruction following across 5000 short and long prompts, including text rendering and style control (Wei et al., 2 Jun 2025), while T2I-ReasonBench evaluates reasoning-informed generation across idioms, textual image design, entity reasoning, and scientific reasoning (Sun et al., 24 Aug 2025). A plausible implication is that both are natural stress tests for DIM, because DIM-T2I emphasizes long-context prompt understanding and DIM-Edit explicitly injects multi-step reasoning into conditioning.

The same applies to fairness and semantic fidelity audits. DIVBENCH separates under-diversification from over-diversification by conditioning diversity on contextual appropriateness (Friedrich et al., 2 Jul 2025), and OASIS measures stereotypes against real-world reference distributions rather than statistical parity (Dehdashtian et al., 1 Jan 2025). A plausible implication is that a model like DIM would benefit from these audits, especially because precise editing systems can either preserve or inadvertently amplify latent demographic associations when they localize edits.

7. Limitations, open problems, and future directions

The main limitations stated for DIM-4.6B-T2I/Edit are dependence on CoT quality, weakness on Text Change, finite data coverage, and its intentionally compact scale (Zeng et al., 2 Sep 2025). Performance at inference depends on the quality of the external designer’s chain-of-thought; although the model is robust across GPT-4o and Qwen2.5-VL designers, poor CoT can still degrade edit precision. Text Change remains a challenging subtask because DIM-Edit contains little text-change-focused data. The datasets, although large and diverse, are still limited to the domains covered by web images and the specific editing corpora used to construct DIM-Edit.

The scaling question is left open. The paper notes that performance may improve by scaling the generative backbone, scaling the DIM datasets, using more powerful external designers, or integrating CoT generation internally rather than relying on an external designer at inference time (Zeng et al., 2 Sep 2025). It also suggests extending Draw-In-Mind to other modalities, specifically video editing and 3D editing.

A plausible implication is that such extensions would require more than textual blueprinting alone. For temporally consistent video editing, for example, separate temporal modeling mechanisms such as a dedicated temporal UNet and spatial-temporal fusion units, as in “Edit Temporal-Consistent Videos with Image Diffusion Model” (Wang et al., 2023), would likely be necessary. Likewise, for large-scale open fine-tuning, datasets such as Fine-T2I—with over 6 million high-quality text-image pairs and explicit task and style coverage—suggest a data-centric route toward improving instruction adherence before edit specialization (Ma et al., 10 Feb 2026).

Taken together, DIM-4.6B-T2I/Edit represents a specific answer to a recurring systems question in multimodal generation: whether precise editing is primarily a rendering problem or a planning problem. The Draw-In-Mind results support the latter view. In this formulation, the decisive variable is not only generator scale, but also whether the model is taught to think and describe before it paints (Zeng et al., 2 Sep 2025).