Omni-Diffusion: Unified Multimodal Framework

Updated 4 July 2026

Omni-Diffusion is a unified diffusion-based framework that integrates multiple modalities—such as text, speech, and images—within a single modeling scheme.
It employs shared token spaces, modality-specialized diffusion, and adaptive role assignment to support a wide range of applications from video synthesis to medical segmentation.
The framework enhances performance by enabling iterative refinement, bidirectional context, and robust control over multimodal outputs.

Omni-Diffusion is a recent diffusion-centered research direction in which a single framework is designed to cover multiple modalities, task roles, control signals, or preference views rather than solving each subproblem with a separate model. In the narrowest and most literal sense, "Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion" defines it as a unified mask-based discrete diffusion model that directly captures the joint distribution over discrete multimodal tokens for text, speech, and images (Li et al., 6 Mar 2026). In broader arXiv usage, however, the term also names systems that unify omni-motion control in video, omni-view synthesis in 3D/4D, omni-preference alignment in video generation, omni-dimensional compression for latent video diffusion, or full-spectrum annotation viewpoints in medical segmentation (Wei et al., 12 Mar 2026, Fan et al., 11 Dec 2025, Liu et al., 2024, Chen et al., 2024, Zhang et al., 17 Jul 2025). This suggests that Omni-Diffusion is best read not as a single canonical architecture but as a family of unification strategies built around diffusion-based generative modeling.

1. Terminological range

Recent work uses the word omni in several bounded senses rather than one universal meaning. In some papers, it means any-to-any multimodal modeling across text, image, and speech; in others, it means unified control across several video conditions; in others still, it denotes completeness over viewpoints or preferences within a domain rather than modality completeness (Li et al., 6 Mar 2026, Xi et al., 15 Apr 2025, Zhang et al., 17 Jul 2025, Liu et al., 2024).

Work	Scope	Meaning of “omni”
"Omni-Diffusion" (Li et al., 6 Mar 2026)	Text, speech, images	Any-to-any multimodal understanding and generation
"DreamVideo-Omni" (Wei et al., 12 Mar 2026)	Video customization	Multi-subject identity with global, local, and camera motion control
"OmniView" (Fan et al., 11 Dec 2025)	3D/4D view synthesis	Unified camera- and time-conditioned generation tasks
"DiffOSeg" (Zhang et al., 17 Jul 2025)	Medical segmentation	Joint modeling of consensus-driven and preference-driven annotation views
"VideoDPO" (Liu et al., 2024)	Video preference alignment	Joint optimization of visual quality and semantic alignment
"Ambient Diffusion Omni" (Daras et al., 10 Jun 2025)	Diffusion training	Using low-quality, synthetic, and out-of-distribution images

A persistent misconception is that omni always means universal multimodality. The literature explicitly rejects that simplification. DiffOSeg states that its “omni” quality is not about multimodality across text, audio, and video, but about preserving both population consensus and expert-specific preference in medical image segmentation (Zhang et al., 17 Jul 2025). VideoDPO uses “omni” to describe a composite preference signal spanning visual quality and semantic alignment rather than an any-to-any model interface (Liu et al., 2024). Ambient Diffusion Omni uses the term for a training framework that can extract signal from all available images, including low-quality and out-of-distribution ones, rather than for multimodal generation (Daras et al., 10 Jun 2025).

2. Recurrent architectural patterns

Despite this semantic spread, several recurrent design patterns appear. One pattern is direct joint modeling over a shared token space. Omni-Diffusion uses a unified mask-based discrete diffusion model over text, speech, and image tokens (Li et al., 6 Mar 2026). Dynin-Omni similarly formulates omnimodal modeling as masked diffusion over a shared discrete token space spanning text, image, speech, and video understanding, with one embedding matrix and one bidirectional Transformer (Kim et al., 9 Mar 2026).

A second pattern is modality-specialized diffusion under a shared backbone. LLaDA-o adopts a Mixture of Diffusion framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation while coupling them through a shared attention backbone (You et al., 1 Mar 2026). This avoids forcing text and image generation into the same corruption process while preserving a unified multimodal model.

A third pattern is adaptive role assignment for conditions and targets. OmniVDiff learns a joint distribution over RGB, depth, semantic segmentation, and canny-edge videos in one latent diffusion process, while an adaptive modality control strategy dynamically changes whether each modality is treated as a generation modality or a conditioning modality (Xi et al., 15 Apr 2025). Related video systems use structured condition binding rather than pure joint tokenization: DreamVideo-Omni defines a control-unit abstraction pairing subject appearance with global and local motion cues, while OmniView explicitly factorizes space, time, and view conditions instead of entangling them (Wei et al., 12 Mar 2026, Fan et al., 11 Dec 2025).

A fourth pattern is decoupled understanding and synthesis. Omni-Video 2 and Tele-Omni both place a pretrained multimodal LLM in front of a video diffusion model: the MLLM parses instructions and visual context, while a video DiT performs synthesis conditioned on projected multimodal features and reference latents (Yang et al., 9 Feb 2026, Liu et al., 10 Feb 2026). In these systems, omni behavior is achieved through a unified conditioning interface rather than through a single homogeneous diffusion law.

3. Omnimodal token-space diffusion models

The most literal omni-diffusion line is the masked discrete diffusion family. Omni-Diffusion defines an any-to-any multimodal LLM built entirely on mask-based discrete diffusion models. Text, image, and speech are represented as a unified sequence with modality delimiters; training masks tokens and minimizes masked-token cross-entropy over the corrupted positions (Li et al., 6 Mar 2026). The model uses MAGVIT-v2 for image tokenization, SenseVoiceSmall for speech understanding, and the GLM-4-Voice decoder for speech synthesis. On LibriSpeech and LibriTTS it reports WER of 7.05 and 3.07 respectively, while on visual tasks it reports MME-Perception 1216.7, Seed-2-Plus 34.5, CLIP-T 0.235, and CLIP-I 0.667 (Li et al., 6 Mar 2026).

Dynin-Omni extends this paradigm into what it calls an omnimodal unified large diffusion LLM. It represents text, images, speech, and videos as discrete tokens in one vocabulary and trains a single masked diffusion Transformer with an absorbing-state masking process (Kim et al., 9 Mar 2026). Its three-stage pipeline combines modality adaptation, model-merging-based modality expansion, and omnimodal alignment. Reported results include 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean (Kim et al., 9 Mar 2026). Relative to autoregressive unified models, the paper argues that masked diffusion provides bidirectional context and iterative refinement rather than a fixed left-to-right serialization.

LLaDA-o occupies an intermediate position. It remains diffusion-native, but rejects a single diffusion parameterization for all modalities. Its understanding expert uses discrete masked diffusion over text and projected visual semantic tokens, while its generation expert uses continuous diffusion in latent space for image synthesis (You et al., 1 Mar 2026). A data-centric adaptive length augmentation procedure enables flexible-length multimodal response generation without architectural changes. The reported headline generation result is 87.04 on DPG-Bench, while multimodal understanding results include 66.1 on MathVista, 87.9 on ChartQA, and 91.5 on DocVQA; its intra-modality bidirectional attention yields a reported 5.9× speedup relative to LLaDA-V on MathVista with comparable performance (You et al., 1 Mar 2026).

Taken together, these works define a clear subfield in which diffusion is not merely attached to a multimodal stack but serves as the backbone for multimodal understanding and generation itself. A plausible implication is that the central design choice is no longer simply diffusion versus autoregression, but which parts of a multimodal system should share a diffusion process and which should remain modality-specific.

4. Unified controllable video, audio, and 4D generation

A second major branch uses omni-diffusion to unify multiple forms of conditioning and control in generation-heavy systems. OmniVDiff is a unified controllable video diffusion framework that learns a joint distribution over RGB, depth, semantic segmentation, and canny-edge videos in one latent space (Xi et al., 15 Apr 2025). It supports text-conditioned multi-modal video generation, video understanding by predicting depth/segmentation/canny from RGB, and $X$ -conditioned video generation. On UCF101 it reports FVD 527.12 and KVD 60.79 for text-conditioned RGB video generation, compared with CogVideoX at 584.74 and 70.20 (Xi et al., 15 Apr 2025).

DreamVideo-Omni extends the idea of omni from modalities to motion granularity. Its objective is harmonious multi-subject customization with simultaneous control over subject appearance, global object motion, local object motion, and camera movement (Wei et al., 12 Mar 2026). The framework combines condition-aware 3D rotary positional embedding, hierarchical motion injection, group and role embeddings, and a second-stage latent identity reward feedback learning procedure. On DreamOmni Bench it improves R-CLIP from 0.731 to 0.739, R-DINO from 0.429 to 0.499, Face-S from 0.157 to 0.301, mIoU from 0.212 to 0.558, and lowers EPE from 24.05 to 9.31 relative to DreamVideo-2 (Wei et al., 12 Mar 2026).

AudioGen-Omni makes the output space narrower—audio only—but expands the conditioning and task space. It is a unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation, conditioned on arbitrary subsets of video, text, lyrics, and transcription (Wang et al., 1 Aug 2025). It uses conditional flow matching in a latent audio space, joint attention across modalities, AdaLN conditioning, and phase-aligned anisotropic positional infusion. The paper reports 1.91 seconds to generate 8 seconds of audio, and on VGGSound reports $\mathrm{FD}_{\text{PaSST}} = 58.766$ , $\mathrm{IS} = 21.521$ , and $\mathrm{DeSync} = 0.450$ (Wang et al., 1 Aug 2025).

OmniView generalizes the omni notion to 4D consistency tasks. It unifies static and dynamic novel view synthesis, text-to-video with camera control, image-to-video with camera control, and video-to-video camera redirection by separating space, time, and view conditions (Fan et al., 11 Dec 2025). Its core technical claim is that camera tokens should receive 2D RoPE while video tokens receive 3D RoPE, with fusion by channel-wise concatenation rather than additive entanglement. The paper reports up to 33% better image-quality scores on multiview NVS LLFF, 60% on dynamic NVS in Neural 3D Video, 20% on static camera control on RE-10K, and a roughly 4× reduction in camera trajectory errors for text-conditioned video generation (Fan et al., 11 Dec 2025).

Omni-Video 2 and Tele-Omni represent a related but distinct tendency: rather than making diffusion itself solve multimodal understanding, they use a frozen MLLM to parse instructions and references, then inject those semantics into a pretrained text-to-video diffusion model (Yang et al., 9 Feb 2026, Liu et al., 10 Feb 2026). Omni-Video 2 reports 73.53 FiVE-Acc on FiVE-Bench and 84.69 total score on VBench, while Tele-Omni emphasizes flexible multimodal control across text-to-video, image-to-video, first-last-frame generation, in-context generation, and in-context editing within a single model (Yang et al., 9 Feb 2026, Liu et al., 10 Feb 2026).

5. Domain-specific interpretations

Outside general multimodal generation, omni-diffusion has been adapted to highly specialized domains where omni denotes completeness over task viewpoints rather than over modalities. DiffOSeg is exemplary: it is presented as an “omni medical image segmentation” framework because it jointly models a consensus-driven population view and a preference-driven expert view (Zhang et al., 17 Jul 2025). Stage I learns a probabilistic consensus over expert annotations using a categorical diffusion model; Stage II adapts that prior to specific annotators through adaptive prompts. On LIDC-IDRI, DiffOSeg Stage I achieves $GED_{30}=0.0773$ and $D^s_{30}=92.20$ , while Stage II reports $D_{mean}=90.99$ ; on NPC-170 it improves $GED_{30}$ to 0.1822 relative to D-Persona’s 0.2385 (Zhang et al., 17 Jul 2025).

OmniDiT is another domain-specific interpretation. It is an omni virtual try-on framework based on a diffusion transformer that unifies model-based VTON, model-free VTON, and virtual try-off within one latent flow-matching model (Zeng et al., 20 Mar 2026). The model uses token concatenation, adaptive position encoding for multiple reference images, shifted window attention, multiple timestep prediction, and an alignment loss on garment regions. It is supported by the Omni-TryOn dataset with over 380k garment-model-tryon image pairs. On VITON-HD model-based VTON it reports FID 6.4564 and SSIM 0.8838; on DressCode model-free VTON it reports FID 9.6294 and SSIM 0.7758; on DressCode try-off it reports FID 7.5019 and CLIP-I 0.9365 (Zeng et al., 20 Mar 2026).

VideoDPO shifts omni-diffusion from model architecture to alignment. Its “omni-preference” signal combines imaging quality, aesthetic quality, subject consistency, temporal flickering, motion smoothness, dynamic degree, and text-video semantic alignment into OmniScore (Liu et al., 2024). Using automatically generated best-versus-worst preference pairs and pair re-weighting, it improves VideoCrafter2 from 80.44 to 81.93 on VBench total, from 82.20 to 83.07 on VBench quality, and from 73.42 to 77.38 on VBench semantics (Liu et al., 2024).

Ambient Diffusion Omni uses omni in yet another sense: training diffusion models with “all available images,” including low-quality, synthetic, and out-of-distribution examples, by deciding when each sample is usable along the diffusion trajectory (Daras et al., 10 Jun 2025). Its theory exploits Gaussian smoothing, spectral power law decay, and locality. On ImageNet-512, Ambient-o-XXL+crops reports test FID 2.53 with classifier-free guidance, compared with EDM2-XXL at 2.73; on zero-shot COCO text-to-image generation it improves FID-30K from 12.37 to 10.61 (Daras et al., 10 Jun 2025). Here omni-diffusion refers to data utilization across heterogeneous image sources rather than to multimodal I/O.

A closely related component-level contribution is OD-VAE, which introduces omni-dimensional video compression by jointly compressing time and space before latent video diffusion (Chen et al., 2024). Its selected setting uses temporal compression $4\times$ and spatial compression $8\times$ , yielding $\mathrm{FD}_{\text{PaSST}} = 58.766$ 0; when plugged into Latte, it reduces denoiser memory from about 74GB to about 30GB and improves training speed to 1.80 it/s, compared with 0.87 it/s for SD-VAE and SVD-VAE (Chen et al., 2024). Although OD-VAE is not itself an omni-diffusion model, it shows that the omni idea can also enter at the latent interface.

6. Data, evaluation, systems support, and open problems

Omni-diffusion research is unusually dependent on data curation and benchmark design. DreamVideo-Omni builds a curated 2.12M-video dataset and introduces DreamOmni Bench with 1,027 held-out real-world videos for evaluating identity preservation and omni-motion control (Wei et al., 12 Mar 2026). OmniVDiff trains on 400K pseudo-paired multi-modal videos with depth from Video Depth Anything, segmentation from SemanticSAM plus SAM2, and canny edges from OpenCV (Xi et al., 15 Apr 2025). OmniDiT constructs Omni-TryOn through a self-evolving data pipeline with VLM filtering and synthetic pair generation (Zeng et al., 20 Mar 2026). Omni-Diffusion adds the SDVI dataset for spoken visual QA and speech-to-image generation (Li et al., 6 Mar 2026). These choices show that unification typically requires not only one model but also one task-normalized data schema.

Evaluation is correspondingly heterogeneous. Unified video systems report FVD, KVD, CLIP-T, region-based identity metrics, mIoU, EPE, and FiVE-Acc (Xi et al., 15 Apr 2025, Wei et al., 12 Mar 2026, Yang et al., 9 Feb 2026). Omnimodal LLMs report MME-P, VideoMME, GenEval, DPG-Bench, WER, and task-specific reasoning benchmarks such as GSM8K and MATH (Kim et al., 9 Mar 2026, You et al., 1 Mar 2026, Li et al., 6 Mar 2026). Domain-specific models rely on Dice, GED, LPIPS, CLIP-I, and garment-specific metrics (Zhang et al., 17 Jul 2025, Zeng et al., 20 Mar 2026). This diversity suggests that omni-diffusion is still evaluated as a federation of task families rather than through a single universal benchmark.

Serving and systems work have also begun to adapt. vLLM-Omni treats any-to-any multimodal models as stage graphs composed of autoregressive LLMs, diffusion transformers, vocoders, and encoders, with fully disaggregated serving across stages (Yin et al., 2 Feb 2026). It reports job completion time reductions of up to 91.4% and exposes a broader infrastructure point: once omni models become multi-stage programs rather than single decode loops, serving abstractions become part of the research problem (Yin et al., 2 Feb 2026).

The literature also makes its limitations explicit. Reward optimization in DreamVideo-Omni is fragile and can cause reward hacking if the LIReFL weight is too large (Wei et al., 12 Mar 2026). OmniVDiff depends on pseudo labels and remains limited to four visual modalities (Xi et al., 15 Apr 2025). DiffOSeg assumes known expert identities and leaves unseen-expert generalization for future work (Zhang et al., 17 Jul 2025). Dynin-Omni supports video understanding but not video generation, and its reasoning quality depends on high diffusion step counts (Kim et al., 9 Mar 2026). LLaDA-o remains weaker than strong autoregressive unified models on some understanding tasks and incurs high training cost (You et al., 1 Mar 2026). Tele-Omni’s reported evidence is largely qualitative in the provided text, and Omni-Video 2 requires very large-scale training infrastructure (Liu et al., 10 Feb 2026, Yang et al., 9 Feb 2026).

A plausible implication is that the field has converged on the usefulness of diffusion as a unifying principle, but not yet on a single recipe for unification. Shared discrete token spaces, decoupled diffusion experts, adaptive condition roles, MLLM-to-diffusion bridges, omni-preference objectives, and data-centric ambient training all instantiate different answers to the same question: how much heterogeneity can one diffusion framework absorb before specialization becomes unavoidable.