Draw-In-Mind (DIM): Unified Multimodal Editing

Updated 4 July 2026

Draw-In-Mind (DIM) is a unified multimodal framework that reassigns explicit chain-of-thought design to the understanding module for precise image editing.
It leverages paired datasets DIM-T2I and DIM-Edit to provide rich annotations and blueprint-driven planning for accurate scene layout and localized modifications.
The architecture integrates a frozen Qwen2.5-VL backbone with a trainable SANA module via a lightweight two-layer MLP, outperforming larger models on key editing benchmarks.

Searching arXiv for the relevant "Draw-In-Mind (DIM)" paper and closely related work to ground the article in current research. Draw-In-Mind (DIM) is a unified multimodal framework for precise image editing that reassigns the division of labor between understanding and generation. Its central claim is that contemporary unified systems are often imbalanced: the understanding module mainly translates an instruction into semantic conditions, while the generation module must simultaneously infer scene layout, identify the target editing region, preserve unchanged content, and render the edited result. DIM addresses this by making the understanding module responsible for design through chain-of-thought imagination, so that the generation module can focus on painting. The framework comprises a paired data regime—DIM-T2I and DIM-Edit—and a compact model family, DIM-4.6B-T2I/Edit, obtained by connecting a frozen Qwen2.5-VL-3B to a trainable SANA1.5-1.6B with a lightweight two-layer MLP (Zeng et al., 2 Sep 2025).

1. Problem formulation and modeling principle

DIM is motivated by a specific diagnosis of failure in unified multimodal generation/editing systems. In the standard pipeline, an MLLM or semantic encoder converts the user instruction into latent conditions, and a generative model performs the edit. The paper argues that this leaves the generative model with several hard tasks at once: inferring the original scene structure, understanding object relationships, deciding which region to modify, preserving unchanged regions, and synthesizing the new content. The proposed remedy is not merely more parameters, but a redistribution of responsibility: the understanding side should perform explicit reasoning about the intended edit before generation begins (Zeng et al., 2 Sep 2025).

The “draw-in-mind” principle is therefore a blueprint-first formulation of image editing. DIM states that precise editing improves when the model explicitly reasons about the overall scene layout, local object details, which regions need editing, and what the edited result should look like. In this formulation, the understanding module no longer acts only as a translator. It becomes a designer that produces an intermediate textual plan, and the generation module is relieved of having to infer the full edit specification implicitly.

This design choice is conceptually important because the paper treats editing errors as a consequence of misplaced reasoning rather than of insufficient scale alone. A plausible implication is that DIM belongs to a broader class of methods that improve multimodal generation by restructuring intermediate representations, rather than by only enlarging the backbone.

2. DIM-T2I and DIM-Edit

DIM is built on a dataset with two complementary subsets: a long-context text-to-image corpus for instruction comprehension and a chain-of-thought editing corpus for explicit edit planning (Zeng et al., 2 Sep 2025).

Subset	Size	Primary role
DIM-T2I	14 million image-text pairs	Strengthen long-context comprehension
DIM-Edit	233K image-editing examples	Provide explicit chain-of-thought imaginations

DIM-T2I is constructed from real-world web images with resolutions above 512×512 and is annotated using internal models across 21 dimensions. These dimensions are: Character Name, Scene Description, Actions and Interactions, Context and Environment, Emotion and Sentiment, Relationships and Spatial Arrangement, Color and Texture, Symbolism or Abstract Interpretation, Lighting and Shadows, Details and Fine Elements, Perspective and Composition, Time and Season, Target Audience, OCR, Person Description, Mathematics, Information Extraction, Planning, Science, Perception, and Metrics. The resulting long-context prompts have an average length of 146.76 words. The paper emphasizes that the dataset is collected directly from the real world without aesthetic post-filtering, and argues that rich textual annotation can compensate for the absence of heavy visual curation.

DIM-Edit is assembled from three sources. UltraEdit-160K-CoT contributes 160K highly consistent edit pairs selected using joint filtering with SSIM, DINOv2 similarity, and CLIP similarity. ShareGPT-4o-Image-CoT contributes 46K semantically rich editing samples. HumanEdit-CoT contributes 8K MagicBrush training examples and 19K SEED-Data-Edit-Part3 examples, with particular value for remove operations that are underrepresented in raw UltraEdit.

The construction pipeline for DIM-Edit is centered on GPT-4o. Starting from a raw edit pair and raw prompt, GPT-4o judges prompt–edit alignment and classifies the sample as Misaligned, Partially aligned, or Aligned. Misaligned samples are discarded; partially aligned prompts are refined by adding missing details; aligned prompts are further refined to remove ambiguity and increase precision. The optimized prompt and source image are then given to GPT-4o, which produces a four-step chain-of-thought imagination:

Global Layout Perception (GLP): identify and describe the main objects and their positions.
Local Object Perception (LOP): describe the appearance of each object or background region in detail.
Edit Area Localization (EAL): state which regions or objects will be modified.
Edited Image Imagination (EII): describe what the edited image should look like after modification.

These four stages are the explicit design blueprints used by DIM. They specify what should be preserved, what should be altered, and what the target result should look like, thereby moving the design burden away from the generator.

3. Architecture and training regime

The DIM-4.6B model family combines a frozen multimodal understanding backbone with a trainable diffusion generator. The understanding module is Qwen2.5-VL-3B, the generation module is SANA1.5-1.6B, and the connector is a lightweight two-layer MLP that directly maps multimodal tokens from the understanding module into the generation space (Zeng et al., 2 Sep 2025).

The paper presents this connector choice as intentionally simple. It contrasts the design with MetaQuery-like systems that use a larger connector and instead argues for preserving the pretrained reasoning capacity of the frozen MLLM while keeping the bridge compact. The stated advantages are that it preserves pretrained understanding ability, avoids conflicts associated with tighter coupling, and shows that improved data and responsibility assignment can outperform brute-force scaling.

The training strategy is staged.

For DIM-4.6B-T2I, the connector is first warmed up for 1 epoch with learning rate $2\times10^{-5}$ . Then the connector and SANA are jointly trained for 8 epochs with batch size 256 and the same learning rate. The training set includes DIM-T2I plus an additional 6.9M image-text pairs from MidJourney-V6, COCO, InstructP2P, JourneyDB, HQ-Edit, and Dimba. Qwen2.5-VL-3B remains frozen throughout.

For DIM-4.6B-Edit, training also proceeds in two stages. Stage 1 trains on UltraEdit for 10 epochs with learning rate $1\times10^{-4}$ to establish basic editing ability. Stage 2 fine-tunes on DIM-Edit for 50 epochs with learning rate $1\times10^{-5}$ to learn blueprint-conditioned precise editing.

The paper does not introduce a complex new loss formula. Its operational objective is standard conditional image generation/editing: generate the target image from the source image and the instruction/blueprint. At inference time, GPT-4o serves as the external designer that produces the blueprint format, but does so without seeing the target image. This dependence on an external designer is a defining property of the framework rather than an incidental implementation detail.

4. Empirical performance

DIM is evaluated in both text-to-image generation and image editing. The reported results position the model as strong at text-to-image generation and especially effective at image editing relative to larger open-source systems (Zeng et al., 2 Sep 2025).

Evaluation setting	Reported result	Comparison highlighted in the paper
GenEval	0.77 overall	Strong T2I performance
MJHQ-30K	5.50 FID	Strong T2I performance
ImgEdit	3.67 overall	Better than Step1X-Edit, UniWorld-V1, BAGEL, Janus-4o
GEdit-Bench-EN	6.18 overall	Competitive with Step1X-Edit and stronger than most open-source models

On ImgEdit, DIM-4.6B-Edit achieves 3.67 overall. The paper reports that this is higher than Step1X-Edit at 3.06, UniWorld-V1 at 3.26, BAGEL at 3.20, and Janus-4o at 3.19. It also notes that DIM-4.6B-Edit substantially narrows the gap to GPT-4o-Image, which scores 4.20 overall. This comparison is a central empirical claim because DIM uses only 1.6B trainable generative parameters, while several comparator systems rely on much larger backbones.

On GEdit-Bench-EN, the full-set overall score is 6.18 for DIM-4.6B-Edit, compared with 6.44 for Step1X-Edit, 4.85 for UniWorld-V1, and 6.98 for GPT-4o. The paper states an important qualification: Step1X-Edit has a notable advantage on Text Change, a task absent from DIM-Edit. When that task is excluded, DIM-4.6B-Edit surpasses Step1X-Edit. This is presented as evidence that the CoT editing data is effective, while also showing that dataset coverage remains consequential.

The text-to-image results are used to argue that the long-context DIM-T2I corpus is not only a support dataset for editing. GenEval overall: 0.77 and MJHQ-30K FID: 5.50 are cited as evidence that DIM-T2I forms a strong foundation for unified generation and editing.

5. Ablations, qualitative evidence, and interpretation

The ablation studies are designed to test the paper’s core thesis that explicit blueprints improve editing precision by moving reasoning into the understanding module (Zeng et al., 2 Sep 2025).

The data composition ablation shows distinct roles for the constituent corpora. Training only on ShareGPT-4o-Image yields good semantic alignment but can distort layout. Training only on UltraEdit preserves edit consistency better but is less semantically rich. Combining them improves results, and using the CoT versions improves performance further. The full DIM-Edit mixture yields the best overall outcome. This supports the view that semantic richness, edit consistency, and human-edited examples contribute different capabilities, and that the explicit blueprint formulation integrates them most effectively.

The CoT component ablation removes the individual blueprint stages GLP, LOP, EAL, and EII. All removals reduce performance. The paper identifies LOP, EAL, and EII as especially important, whereas GLP is less critical because the generator can handle global layout more easily. The interpretation offered is direct: the more specific reasoning is moved into the blueprint, the better the generator performs.

The qualitative comparisons reinforce the quantitative results. The paper states that Janus-4o can distort images badly even when trained on GPT-4o-like data, Step1X-Edit can produce less natural edits and struggle with complex multi-object instructions, and DIM-4.6B-Edit produces more natural, instruction-faithful edits. A plausible implication is that DIM’s advantage derives less from architectural novelty in the diffusion backbone than from the explicit structuring of the intermediate representation.

The paper also uses these findings to argue against a size-only explanation of editing quality. Its smaller system outperforms substantially larger ones, which it interprets as evidence that where reasoning occurs can matter as much as how much model capacity is available.

6. Scope, limitations, and disambiguation

DIM has several stated or implied limitations. It relies on GPT-4o to generate high-quality blueprints during both training and inference. Its strongest behavior therefore depends on access to an external designer or an equivalent instruction-refinement mechanism. The task distribution in DIM-Edit is uneven; Text Change is specifically underrepresented, and performance on that category suffers accordingly. The model also uses a relatively simple connector and backbone combination, leaving open the possibility that better blueprint generation or broader task coverage could further improve performance (Zeng et al., 2 Sep 2025).

The name DIM also requires disambiguation in the literature. In the image-generation domain, “Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models” introduces DrUM, not DIM, for personalized text-to-image generation without fine-tuning the diffusion model itself (Kim et al., 5 Aug 2025). The earlier “DRAW: A Recurrent Neural Network For Image Generation” defines DRAW as Deep Recurrent Attentive Writer and explicitly does not define a method called DIM (Gregor et al., 2015). A separate 2025 paper, “Visual Theory of Mind Enables the Invention of Proto-Writing,” uses Draw-In-Mind to denote a mechanism of visual theory of mind within a Signification Game rather than an image-editing architecture (Spiegel et al., 3 Feb 2025). Elsewhere, DIM denotes Domain-Informed Monotonicity in tabular deep learning (Salim et al., 25 Sep 2025) and also refers to the Ding–Iohara–Miki algebra in mathematical physics (Ghoneim et al., 2020). In current multimodal image-editing usage, however, Draw-In-Mind most precisely denotes the blueprint-based editing framework built around DIM-T2I, DIM-Edit, and DIM-4.6B-T2I/Edit (Zeng et al., 2 Sep 2025).