DreamOmni3: A Multimodal Image Editing Framework
- DreamOmni3 is a multimodal framework that integrates text, images, and freehand scribbles for interactive image editing and generation.
- It builds on DreamOmni2 with dedicated LoRA adapters and advanced encoding alignment to achieve fine-grained control over multiple edit regions.
- The framework introduces a novel joint input scheme and benchmark suite that set new standards for scribble-based multimodal image manipulation.
DreamOmni3 is a multimodal framework designed for interactive image editing and generation tasks that utilize user-provided text, images, and freehand scribbles as joint instructions. The system introduces a joint input scheme that directly combines original and user-modified (scribbled) images, distinguishing edit or generation regions based on distinct brush colors in the scribbled input. DreamOmni3 is built atop the DreamOmni2 MM-DiT (multimodal diffusion-transformer) architecture, integrates the Qwen2.5-VL vision-language backbone, and employs a FLUX Kontext joint training strategy with LoRA adapters for editable and generative tasks. It synthesizes expansive and diverse datasets by leveraging automated region selection (“Referseg”) and compositing pipelines, and establishes new benchmarks and metrics for scribble-based multimodal image manipulation (Xia et al., 27 Dec 2025).
1. Architectural Design and Input Modality
DreamOmni3 extends the DreamOmni2 MM-DiT model by implementing a unified multimodal transformer-diffusion backbone. The system applies Qwen2.5-VL as the vision-language foundation and augments it with the FLUX Kontext joint training scheme. It deploys two dedicated LoRA adapters (rank 256 each), one for editing and one for generation. These retain the model’s capability on unmodified (non-scribble) inputs.
Unlike systems relying on single-channel binary masks for region selection, DreamOmni3’s joint input scheme ingests two RGB images in parallel:
- : the original source image;
- : the user-modified image containing colored scribbles that explicitly mark intended regions for editing or generation.
Brush colors in (e.g., red, blue, green) specify multiple, non-overlapping semantic regions within the input. This approach surpasses the scalability limitations of binary mask-based pipelines, allowing for richer and more nuanced control over simultaneous edit locations. When relevant, external reference images () are incorporated as a separate unpaired RGB input, as pixel-perfect alignment outside the edited region is generally unnecessary.
2. Data Synthesis Pipeline
Construction of the DreamOmni3 training corpus begins by synthesizing examples from the multimodal DreamOmni2 dataset, using both Referseg for object localization and direct compositing. The pipeline is stratified into two main branches: scribble-based editing and scribble-based generation, each comprising several canonical subtasks.
Scribble-Based Editing Subtasks
- Scribble + Multimodal Instruction Editing: Automatically localizes target objects in both source and target images (via Referseg), overlays a randomly selected hand-drawn symbol (circle/square), and creates paired text and image-based instructions.
- Scribble + Instruction-Only Editing: As above, but omits the reference image, embedding additional object detail into the textual prompt.
- Image Fusion: Crops the object from the reference image, resizes and pastes it into the source at a region delineated by a scribble, cataloging both transformations and user-stroke overlays.
- Doodle Editing: Converts extracted targets into abstract sketches (“doodles”) using GPT-Image-1, then reinserts both the sketch and its overlay.
Scribble-Based Generation Subtasks
- Scribble + Multimodal Instruction Generation: Populates a blank () canvas with scribble symbols at each target location; auxiliary reference and multimodal instruction direct the generation.
- Scribble + Instruction-Only Generation: Omits reference images, translating task details solely through textual input.
- Doodle Generation: Inserts a doodle sketch into the blank canvas at the user’s intended location, with textual instructions guiding complete object/context generation.
Dataset statistics:
| Subset | Multimodal | Text-Only | Fusion | Doodle |
|---|---|---|---|---|
| Editing | 32K | 14K | 16K | 8K |
| Generation | 29K | 10K | — | 8K |
3. Objective Functions and Training Protocols
DreamOmni3 formulates each task as a mapping:
Here, matches ground-truth targets as specified by a GIF editor pipeline.
Letting denote model parameters augmented by LoRA adapters, the primary objective is a diffusion reconstruction loss:
with the intermediate noisy latent at time , the injected noise, and the encoded multimodal instruction.
Optionally, adversarial loss is added to induce image sharpness and detail:
where is a real image and is a model output.
The full training loss is:
with typical hyperparameters , (Xia et al., 27 Dec 2025).
4. Encoding Schemes
DreamOmni3 relies on two core encoding mechanisms for input harmonization within the transformer backbone:
Leveraging the sinusoidal scheme proposed in “Scalable Diffusion Models with Transformers” (DiT), each token’s 2D pixel coordinate is encoded as:
Identical positional encodings are applied independently across and at corresponding pixel locations, producing exact alignment in the attention mechanism.
B. Index Encoding
Each input image is assigned a learned index embedding () that acts as a modality identifier within multi-image attention:
The embeddings (dimension ) are added to each token, ensuring that the transformer layers can distinguish and condition operations on input type.
Applying the same position and index encodings to and proves critical for pixel-perfect interpretation and alignment of user strokes, as confirmed by ablation gains exceeding 2% in editing accuracy (Xia et al., 27 Dec 2025).
5. Benchmark Construction and Evaluation Metrics
DreamOmni3 introduces a comprehensive benchmark suite spanning real-world image editing and generation for both concrete objects and abstract attributes, across diverse scene types (people, animals, products, design styles). Each of the four editing and three generation subtasks is systematically evaluated using both automated and human-centric protocols.
Key evaluation metrics:
- VLM Success Rate: Automated binary pass/fail detection (instruction compliance) using Gemini 2.5 and Doubao 1.6 VLMs, for both concrete and abstract edits.
- Human Preference Score: Outcome is successful if at least 3 out of 5 annotators approve the output.
- FID: Frechet Inception Distance for generative realism.
- IoU: Overlap (intersection-over-union) between the user’s indicated scribble region and the model’s resulting edited region, measuring spatial precision.
Benchmark results (human evaluation, pass rates):
| Task | DreamOmni3 | GPT-4o | Nano Banana |
|---|---|---|---|
| Scribble Editing | 57.5% | 58.8% | 41.3% |
| Scribble Generation | 53.5% | 39.5% | 23.3% |
Ablation experiments show a +10% gain for editing with joint (paired) input over single-image schemes, and further improvements with synchronized encoding: +2.5% editing, +12% generation (Xia et al., 27 Dec 2025).
6. Distinctive Features and Contributions
DreamOmni3 introduces several advancements over prior multimodal editing and generation frameworks:
- Beyond Binary Masks: Colored scribbles in RGB space offer scalable, fine-grained user control, avoiding the complexity and inflexibility of binary inpainting masks, particularly for multi-region edits. The system’s joint input pipeline maintains pixelwise consistency in unedited regions.
- Unified Multimodal Integration: Enables seamless incorporation of textual, image, and scribble-based instructions within a single system, maintaining the underlying diffusion-transformer structure established in DreamOmni2.
- Encoding Alignment: Identical application of positional and index encodings to both and is shown to be critical for accurate region localization and consistent editing fidelity.
- Benchmark Establishment: Establishes the first comprehensive, real-image evaluation suite for scribble-based editing and generation, equipped with multimodal and automated evaluation strategies.
Collectively, these features position DreamOmni3 as an extensible, interactive framework for high-fidelity image creation and editing, combining the expressive affordances of freehand sketching with robust diffusion-based image generation and manipulation (Xia et al., 27 Dec 2025).