OminiGen: Unified Multimodal Generation

Updated 23 December 2025

OminiGen is a unified generative model family enabling tasks such as text-to-image synthesis, advanced editing, and sensor simulation for autonomous driving.
It employs innovative diffusion techniques, including rectified flow and Omni-RoPE encoding, alongside a unified Transformer for seamless multimodal processing.
Experimental benchmarks reveal robust zero-shot task transfer and high-quality, consistent outputs across diverse modalities and applications.

OminiGen refers to a class of unified generative models designed to handle complex multimodal tasks within a single framework. This family initially targeted unified image synthesis and editing, and was later extended to encompass advanced multimodal generation and autonomous driving sensor simulation. Central to OminiGen systems is architectural simplicity coupled with the ability to span diverse tasks—ranging from text-to-image, image editing, and visual reasoning, to strictly aligned LiDAR and camera sensor emulation for autonomous systems. Notable milestones include "OmniGen: Unified Image Generation" (Xiao et al., 17 Sep 2024), "OmniGen2: Exploration to Advanced Multimodal Generation" (Wu et al., 23 Jun 2025), and "OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving" (Tang et al., 16 Dec 2025).

1. Unified Diffusion Framework for Image Generation

The original OminiGen (Xiao et al., 17 Sep 2024) pioneered the first unified, end-to-end diffusion model for general-purpose image generation. The system combines a frozen Variational Autoencoder (VAE) from SDXL to map pixel images into a latent space, and a single large Transformer (initialized from Phi-3) that autoregressively processes text and image tokens. Crucially, no specialized plug-ins or task-specific branches are required. Instead, arbitrary interleaved instructions (e.g., text, image references, segmentation masks) are tokenized and processed in a unified sequence.

Input images are encoded into non-overlapping patches (patch size 2) and embedded as "visual tokens," delimited by special <img> markers; text is tokenized by the Phi-3 tokenizer. A modified attention mask enables bidirectional context within each image span, while strictly causal ordering across the sequence, ensuring stable high-resolution synthesis and effective conditional generation.

OminiGen replaces the standard DDPM forward process with rectified flow (linear velocity field) diffusion. For $x$ a clean latent and $\epsilon\sim\mathcal{N}(0,I)$ , the noised latent at timestep $t$ is $x_t = t x + (1-t)\epsilon$ . The model predicts the velocity $x-\epsilon$ , trained with the mean squared loss:

$\mathcal{L} = \mathbb{E}_{x,\epsilon, t, c}\left[ \| (x-\epsilon) - v_{\theta}(x_t, t, c) \|^2 \right].$

For editing, a spatial weighting mask upweights loss where the output must differ from the input, efficiently suppressing shortcut copying in region editing scenarios.

OminiGen was trained on the X2I corpus (0.1B images, covering text-to-image, editing, control tasks, and classic vision problems) in staged resolution up to $2240^2$ . This led to generalization abilities including zero-shot task transfer, task composition, and in-context domain adaptation.

2. Extension to Advanced Multimodal Generation: OmniGen2

OmniGen2 (Wu et al., 23 Jun 2025) advances the original model by introducing a dual-path architecture with strict parameter decoupling for image and text modalities. The multimodal backbone, initialized from Qwen2.5-VL-3B, passes interleaved text and ViT-encoded image tokens, with image-specific positional encoding (Omni-RoPE) capturing both sequence and 2D spatial coordinates.

Text generation employs an autoregressive head with classic cross-entropy loss, while images are produced by a dedicated diffusion transformer (32 layers, hidden size 2520, $\sim$ 4B params) operating exclusively on VAE latents and extracted hidden states from the multimodal backbone. No parameters are shared between text and image branches—a design choice, validated by ablations, that preserves high image quality and the original LLM vision-language capabilities.

Training data includes 140M open-source image-caption pairs for text-to-image, refined image editing pairs (inpaint- and video-derived), and new in-context generation and editing pipelines leveraging subject appearance consistency filtering and instruction synthesis (Qwen2.5-VL-72B, DINO, GroundingDINO, SAM2).

A key innovation is the reflection mechanism: after generating an image, a strong multimodal LLM critiques failures and proposes corrections, recursively producing a rich reflection dataset. Fine-tuning on this data improves edit consistency, compositionality, and robustness to model errors.

3. Experimental Results and Benchmarks

OminiGen and its descendants demonstrate competitive or state-of-the-art performance across canonical benchmarks:

Task Domain	Benchmark	OmniGen / OmniGen2	Open-source SOTA
Text-to-Image (GenEval)	Overall	0.70 (OGen v1)	SD3: 0.68
		0.80/0.86 (OGen2)	BAGEL: 0.88
Image Editing (EMU-Edit)	CLIP-T/CLIP-I	0.231 / 0.829	EMU-Edit: 0.231/0.859
In-Context Generation	OmniContext	7.18 (OGen2)	BAGEL: 5.73
Subject-Driven Generation	DreamBench (CLIP-T/I)	0.315 / 0.801	DB: 0.305 / 0.803
Visual Control	Segm mIoU/F1/SSIM/RMSE	40.06/38.96/0.8332/31.71	ControlNet++: 43.64
Classic Vision (Qualitative)	Deblur/Derain/etc.	Plausible results

Reflection-based fine-tuning in OmniGen2 yields further gains in consistency and prompt fidelity, particularly in in-context and long-prompt settings. For example, on DPG-Bench, 83.57% correctness is achieved, near parity with SD3-medium (84.08%) and BAGEL (85.07) (Wu et al., 23 Jun 2025).

4. Unified Multimodal Sensor Generation in Autonomous Driving

OminiGen (Tang et al., 16 Dec 2025) extends the unified generation paradigm to concurrent LiDAR and camera sensor simulation for autonomous driving, addressing limitations of prior single-modality approaches. Central to this system is a unified Bird’s Eye View (BEV) latent space. Multi-view camera images are projected via Lift-Splat-Shoot into a 3D voxel grid, fused with voxelized LiDAR point clouds (via sparse 3D CNN), then collapsed into a BEV feature map where each 2D pixel encodes fused vertical structure.

The Unified Autoencoder for Environment (UAE) encodes sensor datastreams into discrete BEV tokens; decoding leverages a NeRF-style volume rendering process, reconstructing camera images and LiDAR with accurate spatial alignment. For each modality, rays are cast through the latent BEV, aggregating features and employing a signed distance field predicted by a lightweight MLP. Vector quantization ensures a diffusion-friendly latent representation.

Latent diffusion is performed in BEV token space by a Diffusion Transformer (DiT) with an optional ControlNet branch, conditioned on BEV sketches, 3D bounding boxes, and text (T5 embeddings). Classifier-free guidance enables fine-grained control over input modalities.

Evaluations show state-of-the-art unified multi-sensor generation, with camera results of PSNR↑30.21 dB, SSIM↑0.909, LPIPS↓0.033 and LiDAR Chamfer↓0.793, F-score↑0.742. Generated samples show perfect multi-view consistency and semantic alignment between modalities, outperforming single-modality baselines in both fidelity and utility for downstream 3D detection and planning augmentation, as measured by mAP, NDS, L2 error, and collision rate (Tang et al., 16 Dec 2025).

5. Applications, Generalization, and Limitations

OminiGen models have enabled several practical advances:

Instruction-driven, multimodal generation: Open-form multimodal instructions—e.g., inpainting, stylistic transformation, object insertion—are handled with a single, user-facing interface without task-specific customization.
Embodied agent pipelines: Unified visual reasoning and synthesis permit agents to perform perception, scene understanding, and generation/modification in a closed loop.
Creative and process-level tools: Chain-of-thought generation (e.g., simulating artistic workflows) is feasible but current implementations show lower image fidelity for stepwise painting tasks compared to one-shot generation, motivating continued research into process-aware supervision (Xiao et al., 17 Sep 2024).
Autonomous driving simulation: Reliable, geometrically consistent LiDAR and camera synthesis supports rare event generation and data augmentation, confirmed by quantitative gains in downstream tasks such as 3D detection and motion planning (Tang et al., 16 Dec 2025).

Known limitations include subpar rendering of textual content in images, persistent errors in fine details (particularly hands and small objects), unpredictable behavior for untrained modalities (e.g., surface normal maps), and ultimate fidelity constraints due to model and data scale. In sensor synthesis, the computational cost remains moderate, with OminiGen inference times (∼5.2 s/frame) close to specialized pipelines.

6. Implementation, Data, and Open-Source Release

All major OminiGen variants are open-sourced. OmniGen and OmniGen2 release model definitions (PyTorch/HuggingFace), comprehensive training scripts (including base and reflection fine-tuning phases), pipelines for curating large-scale supervised and video-derived multimodal datasets, and evaluation suites spanning all reported benchmarks (Xiao et al., 17 Sep 2024, Wu et al., 23 Jun 2025). Dependencies and reproducibility artifacts include Docker/Conda environments, dataset manifests, and evaluation notebooks.

For the autonomous driving extension, pre- and post-processed datasets (e.g., nuScenes), codebooks, rendering pipelines, and end-to-end model weights are provided, supporting both research and applied development in unified sensor simulation (Tang et al., 16 Dec 2025).

7. Ablation Findings and Design Decisions

Empirical ablations demonstrate:

Strict parameter decoupling of image and text generation improves image quality (OmniGen2).
Decoupling VAE features from the multimodal backbone resolves architectural complexity and preserves downstream abilities.
MoE-based parameter sharing and query-token compression approaches underperform or degrade fine detail.
The Omni-RoPE positional encoding, integrating global sequence id and local spatial coords, yields substantially improved region-preserving edits.
In autonomous driving, omission of sketch or object box conditionings significantly reduces generation fidelity, as measured by FID and mAP.

These results validate the architectural minimalism and modular decoding adopted throughout the OminiGen series, and highlight unified latent-space modeling as a robust platform for future multimodal foundation models.