DreamOmni2: Multimodal Editing & Generation

Updated 12 October 2025

DreamOmni2 is a unified multimodal framework that supports both text and image-based instructions to perform fine-grained editing and manipulate abstract attributes.
The framework employs a dual-branch feature mixing and a novel position encoding shift to enhance visual fidelity and overcome limitations of traditional generation methods.
Its comprehensive data synthesis pipeline and joint training with vision-language models demonstrate robust performance across diverse creative applications and benchmarks.

DreamOmni2 is a unified multimodal framework for instruction-based editing and generation, designed to overcome the limitations of existing image editing and subject-driven generation models by fully supporting both textual and image-based instructions and extending generation capability from concrete object-centric cases to abstract concept manipulations. The model framework, data synthesis pipeline, and new benchmarks are specifically architected to address fine-grained, real-world user needs in creative domains and general artificial intelligence research (Xia et al., 8 Oct 2025).

1. Motivation and Scope

DreamOmni2 addresses two long-standing challenges in controllable image generation: (1) ambiguity in language-only instruction-based editing, which fails to capture nuanced visual details, and (2) the narrow focus of subject-driven generation on concrete objects and people, neglecting the flexible manipulation of abstract attributes such as texture, style, or mood. By accepting multimodal instructions—including both text and one or more reference images—DreamOmni2 enables detailed, precise control over both direct object modifications and the transfer or creation of abstract attributions. The expanded task formulation closely aligns with practical workflows in design, advertising, and content creation, where complex combinations of literal and stylistic cues are often required.

2. Core Tasks: Multimodal Instruction-Based Editing and Generation

DreamOmni2 proposes two novel unified tasks:

Multimodal Instruction-Based Editing: Accepts both a source image and multimodal instructions (textual and/or reference images) to output a modified version of the source, supporting edits of both concrete elements (e.g., altering an object’s pattern using a reference) and abstract attributions (e.g., transferring material, mood, or hairstyle).
Multimodal Instruction-Based Generation: Allows the generation of a new image from a multi-image instruction set and textual description, leveraging reference images that encode not just literal content but also abstract conceptual or stylistic guidance. The model is trained to associate these references with nuanced changes in the generated image, bypassing the main limitations of previous subject-driven generation approaches.

This dual formulation broadens the scope from hard-constrained, object-centric edits to genuinely creative, concept-driven synthesis, which is essential for applications in fields such as fashion, industrial design, and digital art.

3. Data Synthesis Pipeline

The data pipeline for training DreamOmni2 is constructed in three sequential stages to provide large-scale, high-quality examples that pair source images, reference images, multimodal instructions, and target output images.

Stage 1: Dual-Branch Feature Mixing for Extraction Data

A two-branch feature mixing mechanism creates paired (source, target) images that share defined features (object or abstract attribute). The mixing is realized at the attention layer, with query/key/value tensors constructed as:

$\text{Attn}_{\text{tar}}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d}}\right) V,$

where $Q = [Q_\text{tar}^{n}; Q_\text{tar}^{t}]$ and $K = [K_\text{tar}^{n}; K_\text{tar}^{t}; K_\text{src}^{n}]$ , using both noise and text features from target and noise from source branches. This attention construction enables high-resolution feature mixing and overcomes blending defects seen in diptych-based methods.

Stage 2: Multimodal Editing Data Formation

Target images are generated via text-to-image models (prompts from an LLM) or by extracting keywords from real images (using a VLM).
An extraction model (trained on Stage 1 data) identifies reference image segments corresponding to the attributions.
An instruction-based editing model modifies the target image (per extracted keyword), yielding the source image.
An LLM composes the instruction, producing a tuple (source, instruction, reference, target) for supervised training.

Stage 3: Multimodal Generation Data

Extraction is performed again on Stage 2 source images to create further reference images, generating diversified multimodal tuples for the generation task.

This pipeline yields training data that encompasses both local (object-level) and global (abstract attribution) guidance and accommodates varying numbers of reference images, ensuring robustness across complex multimodal conditions.

4. Model Architecture: Index Encoding and Position Encoding Shift

DreamOmni2’s backbone extends standard sequence-based unified generation/editing models to support multi-image (and manifold modality) input:

Index Encoding: Each reference image is indexed explicitly using dedicated tokens or input channels. This allows the model to maintain a correspondence between instruction position (e.g., "Image 3") and content during the fusion and generation process.
Position Encoding Shift: When multiple images are concatenated in the input, naive position encoding leads to spatial ambiguity and pixel confusion. DreamOmni2 introduces a shifting scheme in which position encodings are offset for subsequent images, proportionate to the size of preceding images. This design preserves spatial separability and reduces copy-paste or overfitting artifacts common in earlier concatenation frameworks.

The combination of index encoding and position shift is empirically demonstrated to be crucial for visual fidelity and correct detail transfer, especially in cases with closely similar or highly abstract reference examples.

5. Joint Training with Vision-LLMs

One practical challenge addressed is the gap between well-structured training data and the irregular, potentially logically inconsistent instructions encountered in public deployment. DreamOmni2 employs a tightly coupled joint training approach:

A vision-LLM (VLM; e.g., Qwen2.5-VL 7B) is fine-tuned to normalize raw user instructions into structured, model-consumable descriptions.
The generation/editing backbone is trained using these refined outputs (via parameter-efficient LoRA updates on Flux Kontext).

This joint paradigm allows DreamOmni2 to maintain high performance even in challenging out-of-distribution or complex instruction scenarios, as validated by benchmark metrics and expert reviews.

6. Benchmarks and Empirical Performance

A bespoke benchmark for DreamOmni2 is introduced, covering:

Real-world images with both concrete and abstract transformation goals.
Scenarios with 1–5 reference images, testing the model’s robustness to increasing multimodal input complexity.

Quantitative assessments (with Gemini, Doubao, and human experts) demonstrate that DreamOmni2 outperforms reference methods such as GPT-4o, Nano Banana, and UNO in both instruction-based editing and generation. Key strengths include improved success rate, consistency between edits and instructions, and faithful transfer of both object and abstract visual features.

Visual inspection confirms DreamOmni2 produces outputs with accurate attribute blending and significantly reduced artifact rates compared to prior state-of-the-art.

7. Implications and Future Directions

DreamOmni2 demonstrates the feasibility and advantages of multimodal instruction processing for unified generation and editing. Its framework supports complex creative tasks, automated content creation, and applications requiring nuanced, semantically consistent visual control. By setting a new benchmark for multimodal–multireference workflows, the work will likely drive advancement in robust content editing and the intersection of vision–language–image understanding models. A plausible implication is that future systems may further expand to support additional modalities (e.g., video, 3D), as the architecture is inherently extensible to a broad class of world modeling and AGI tasks.

DreamOmni2 also establishes a paradigm for training with synthetic data pipelines and for bridging the structure gap between real-world instructions and model-ready input, which may influence data curation practices and the design of advanced instruction-following models in vision applications.

PDF Markdown Chat (Pro)

References (1)

DreamOmni2: Multimodal Instruction-based Editing and Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DreamOmni2.