Papers
Topics
Authors
Recent
2000 character limit reached

DreamOmni3: Scribble-Based Image Editing

Updated 31 December 2025
  • DreamOmni3 is a transformer-based diffusion framework that integrates freehand scribble control with multimodal inputs for fine-grained image editing and generation.
  • It employs a joint input encoding scheme to harmonize spatial and semantic features from images, text, and reference cues, enhancing localization and overall performance.
  • Supported by a large synthetic dataset with hand-drawn overlays, DreamOmni3 significantly advances benchmark accuracy in both scribble-based editing and generation tasks.

DreamOmni3 is a transformer-based diffusion framework for image editing and generation that introduces scribble-based user interaction to enable fine-grained, location-specific control beyond text-only prompts. By allowing freehand specification of editing regions and blending user scribbles, text, and reference images, DreamOmni3 expands the capabilities of unified multimodal models to cover a broader spectrum of creative and practical image manipulation tasks. The model is supported by a large synthetic dataset constructed with overlays of hand-drawn symbols and doodles, and features a novel joint input encoding paradigm to harmonize spatial and semantic localization across modalities (Xia et al., 27 Dec 2025).

1. Task Taxonomy: Scribble-Based Editing and Generation

DreamOmni3 formalizes two primary categories of image tasks—scribble-based editing and scribble-based generation—each subdivided by input modalities and operational goals.

  • Scribble-Based Editing
    • Scribble + Multimodal-Instruction Editing: Source image overlaid with a scribble marking a region, optional reference image with marked target object, and textual instruction. The model modifies the indicated region to match reference semantics/context as specified (e.g., replace a circled car with a sports car from the reference).
    • Scribble + Instruction-Only Editing: Source image with a region marked by scribble, plus a text instruction. Example use: “Make the circled window larger.”
    • Image Fusion: Source image (scribble or automatic mask localizes insertion), reference image (object to extract), and text. The system pastes an extracted object into the marked location with harmonized appearance.
    • Doodle Editing: User draws an abstract doodle over a region in the source image; text instruction guides the conversion of the doodle into a photorealistic object inserted into the scene.
  • Scribble-Based Generation
    • Scribble + Multimodal-Instruction Generation: On a blank canvas, the user marks positions with scribbles, supplies an optional reference image with target object marked, and provides a text prompt. The object is synthesized into the designated location, guided by style/context as needed.
    • Scribble + Instruction-Only Generation: Canvas marked with a scribble and accompanied by a descriptive prompt specifying an object to generate (e.g., “Generate a bicycle at the circled spot”).
    • Doodle Generation: Converts a user’s abstract line drawing and description into a detailed, contextually appropriate scene or object.

This expanded grammar of input and instruction modalities allows nuanced spatial and semantic edits with explicit region selection, advancing beyond the constraints of traditional language-based editing.

2. Data Synthesis and Dataset Construction

The DreamOmni3 dataset is derived from DreamOmni2’s multimodal corpus, augmented via a procedural pipeline designed to simulate authentic user input and diverse editing scenarios.

  • Editable-Region Extraction: Object masks and bounding boxes are programmatically extracted from each instance using Referseg, yielding precise spatial regions for cropping or overlay.
  • Hand-Drawn Symbol Overlays: A prebuilt template bank of 30 boxes and circles, with naturalistic imperfections and a limited palette (red, green, blue), is randomly sampled and resized to fit the identified regions. These overlays indicate regions of interest, serving as training scribbles.
  • Image Fusion Procedure: Objects are excised from reference images (via mask), resized, and composited into source images at the marked location. Optional overlays of scribble further emphasize the edit area and simulate user intent.
  • Doodle Generation/Editing: Employs GPT-Image-1 to convert object crops into abstract doodles, deliberately eschewing simple edge detection to avoid low-informative, noisy sketches. These doodles serve as either starting points for image generation on a blank canvas or edit instructions over an existing scene.
  • Statistics:
Task Category Subtask Number of Samples
Scribble-Editing Multimodal + scribble 32,000
Text + scribble 14,000
Image fusion 16,000
Doodle editing 8,000
Scribble-Generation Multimodal + scribble 29,000
Text + scribble 10,000
Doodle generation 8,000

This pipeline generates a comprehensive, large-scale resource for training and benchmarking scribble-mediated vision models.

3. Model Architecture and Joint-Input Encoding

DreamOmni3 leverages the MM-DIT (multimodal diffusion transformer) architecture, conditioning on both image and text, and introduces a joint-input scheme for spatially precise region control.

  • Joint Input Scheme: Instead of a single binary mask, the model accepts both the original image (I0I_0) and a scribbled counterpart (IsI_s). Both versions are encoded in parallel, preserving unmarked pixel fidelity in I0I_0 and emphasizing user-specified regions in IsI_s.
  • Reference Handling: Reference images, if present, are input with single channels, eschewing duplicate joint encoding to optimize compute.
  • Index and Position Encoding: For each image input kk and patch (u,v)(u, v),

Fk(u,v)=Epatch(Ik)(u,v)+Epos(u,v)+Eidx(role(k))F_k(u,v) = E_{\text{patch}}(I_k)_{(u,v)} + E_{\text{pos}}(u, v) + E_{\text{idx}}(\text{role}(k))

with shared index/position embeddings for I0I_0 and IsI_s, enabling the transformer to spatially correlate features and distinctions across image modalities.

  • Color-Coding: Distinct scribbles (color/stroke shape) implicitly encode different regions, referenced textually in instructions (e.g., “the red scribble”). This cross-modal reference is readily tokenized and attended over in the VLM’s attention mechanisms.

The joint-input strategy, with unified embedding of original and marked images, yields improved localization and editing accuracy over traditional mask-processing methods.

4. Training Objectives and Optimization

DreamOmni3 is trained end-to-end as a conditional diffusion model, applying score-matching denoising objectives typical to diffusion-based generative architectures:

L=EtUniform(1..T),x0,ϵN(0,I)[ϵϵθ(xt,tZ)2]L = \mathbb{E}_{t \sim \text{Uniform}(1..T), x_0, \epsilon \sim \mathcal{N}(0,I)} \left[ \|\epsilon - \epsilon_\theta(x_t, t \mid Z)\|^2 \right]

where ZZ denotes the multimodal joint embedding computed from the concatenation of all image and text representations.

  • Training utilizes LoRA adapters (rank 256) atop a frozen MM-DIT / Qwen2.5-VL 7B backbone, allowing DreamOmni3 to inherit prior visual-linguistic capabilities while adapting new scribble-based control without catastrophic forgetting.

5. Benchmarking, Evaluation Metrics, and Results

The DreamOmni3 benchmark suite encompasses four scribble-editing and three scribble-generation tasks evaluated on real images.

  • Automatic Metrics: Visual-LLMs (VLMs)—Gemini 2.5 (object/scene correctness) and Doubao 1.6 (abstract attributes)—estimate pass rates by validating object presence, fidelity, and compliance with instructions.
  • Human Evaluation: For each task, at least three out of five human raters (engineers) must independently confirm successful edits.
Task DreamOmni3 GPT-4o Nano Banana DreamOmni2
Scribble Editing 57.5% 58.7% 41.3% 17.5%
Scribble Generation 53.5% 39.5% 23.3% 4.7%
  • Ablative Findings:
    • Employing a joint input over a single image increases editing pass rates (35.0% → 45.0%).
    • Shared versus separate index/position encodings confirms that a unified scheme yields better editing (45.0% vs 37.5%).

These results demonstrate that DreamOmni3 achieves competitive or superior accuracy to contemporary large-scale commercial systems in human and VLM-assessed benchmarks, and markedly advances over prior open frameworks.

6. Limitations and Prospective Research

Several areas constrain DreamOmni3’s present capabilities and suggest directions for further study:

  • Gains from joint input are muted in pure generation tasks, indicating ongoing challenges in optimal scribble-to-image grounding.
  • The Referseg-based segmentation pipeline may introduce errors; mislabeled masks can impair scribble localization.
  • The dataset’s doodle input is limited to GPT-Image-1–generated abstractions; real user scribbles exhibit greater variability, motivating the need for future collection of human-annotated scribble corpora.
  • Extending the paradigm to video editing or establishing interactive, iterative scribble-feedback loops is an open area.

This suggests that increasing dataset diversity and refining semantic alignment between scribble regions and generated content may further improve performance and user control.

Summary

DreamOmni3 expands the landscape of unified editing/generation by defining new classes of tasks and user interfaces—incorporating freehand scribble input, synthetic data augmentation, and a transformer-based joint input encoding system. Benchmark analysis confirms DreamOmni3’s significant improvements in both editing accuracy and creative generation, positioning it as a notable advancement in controllable, multimodal AI image systems (Xia et al., 27 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DreamOmni3.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube