Papers
Topics
Authors
Recent
2000 character limit reached

UniMo: Unified Multimodal Models

Updated 10 December 2025
  • UniMo is a unified framework that integrates vision, language, audio, motion, and biomedical modalities into a shared embedding space for cross-modal representation.
  • Innovative architectures like multi-branch transformers and token pruning enable efficient modality alignment and handling of incomplete modality inputs.
  • Robust pre-training strategies and loss functions, including contrastive and modality-consistency losses, improve performance and mitigate modality imbalance.

UniMo refers to a family of frameworks and models that aim to unify disparate data modalities—encompassing vision, language, audio, motion, and biomedical domains—within shared architectures. The term appears in multiple influential works addressing cross-modal representation, generation, correction, and molecular design. Models under the UniMo umbrella are characterized by innovations in shared embedding spaces, modality completion, multimodal sequence modeling, and generalization across heterogeneous input types.

1. Unified Multi-Modal Representation and Embedding

UniMo models fundamentally target the creation of shared spaces for information across multiple modalities. Early incarnations, such as UNIMO (Li et al., 2020), introduced unified transformers for text, image, and paired input, optimizing for both modality-specific (masked language modeling, masked image modeling, Seq2Seq) and cross-modal (contrastive InfoNCE) objectives within a single architecture. This paradigm was further enriched by grounded dictionary mechanisms and multi-branch transformers in subsequent iterations (e.g., UNIMO-2 (Li et al., 2022)), enabling joint alignment and representation via a shared grounded space and comprehensive contrastive learning.

More recent approaches, notably UniMoCo (Qin et al., 17 May 2025), address the challenge of incomplete modality combinations—cases where queries or targets lack one or more modalities—by incorporating a modality-completion module. This module synthesizes missing modalities (e.g., generating visual features from textual descriptions), ensuring that all training examples can be processed in complete or completed form. Embedding-space consistency is enforced via dual losses:

  • Classic contrastive InfoNCE pulls matching pairs together and pushes negatives apart.
  • An auxiliary modality-alignment loss explicitly aligns embeddings between completed and fully multi-modal instances.

This design threads together all modality combinations into a unified Rd\mathbb{R}^d-valued embedding space, robust to both missing modalities and imbalanced training distributions.

2. Model Architectures and Modal Bridging Mechanisms

Multiple UniMo variants propose distinct architectures while sharing principles of modality unification:

  • Transformer-Driven Architectures: UNIMO-2 and UNIMO-3 (2305.13697) employ multi-branch transformers, cross-layer gating, and attention-based fusion to enable adaptive, fine-grained mixing of representations from text and vision encoders, moving beyond fixed single-layer attention.
  • Autoregressive Token Unification: UniMo for video and 3D human motion (Pang et al., 3 Dec 2025) tokenizes both 2D video (using the Cosmos tokenizer) and 3D SMPL-X motion (quantized via VQ-VAE), then interleaves these tokens for joint autoregressive modeling with separate embedding tables and position encoding schemes (absolute and rotary). Task tokens and mode-switch markers manage I2VM and V2M regimes.
  • Token Pruning for Efficiency: UniMoD (Mao et al., 10 Feb 2025) introduces task-aware Mixture-of-Depths (MoD) blocks into unified transformers, using per-task routers to prune redundant tokens in multimodal streams, substantially reducing training FLOPs while preserving (or even improving) benchmark accuracy.

Key in several of these systems is the use of modality-specific embedding layers, shared projector or fusion modules, and specialized routing mechanisms to both align and disentangle modality-specific distributions.

3. Pre-Training Strategies, Loss Functions, and Training Objectives

UniMo models operationalize unified learning through a combination of in-modal and cross-modal pre-training losses:

  • Contrastive Alignment: InfoNCE-style losses on pairs (and hard negatives), as in UNIMO and UniMoCo, pull the modalities into close semantic correspondence.
  • Masked/Corrupted Modality Modeling: Masked language or image modeling losses promote modality-specific representation.
  • Auxiliary Modality-Consistency: UniMoCo enforces embedding invariance between real and completed modalities via explicit cross-entropy alignment of embeddings.
  • Grounded Dictionary/Attention: Grounded learning forces both visual and textual features to attend to a set of salient dictionary tokens, ensuring bridging across unpaired data.
  • Instruction Tuning & Visual-Enhanced Losses: Variants like UNIMO-G (Li et al., 24 Jan 2024) use instruction-tuned diffusion U-Nets, with additional spatial attention losses aligning cross-attention maps with entity masks for subject-driven image synthesis.

Curriculum strategies, batch composition, and negative sampling are tuned to integrate large unimodal and paired corpora, maximizing the synergy from diverse data.

4. Task Domains and Empirical Performance

UniMo frameworks have been deployed across:

  • Vision-Language: VQA, classification, retrieval, captioning, visual grounding (e.g., UniMoCo achieves state-of-the-art Precision@1 on MMEB (Qin et al., 17 May 2025); UNIMO-3 leads in VQAv2 and textual GLUE tasks (2305.13697)).
  • Video and 3D Motion: Joint autoregressive generation and understanding of synchronized video and 3D human motion, with strong improvements in MPJPE, PA-MPJPE, VBench and Motion FID compared to baselines (Pang et al., 3 Dec 2025).
  • Medical Imaging: Universal motion correction for 3D imaging (e.g., fetal fMRI, CT)—one-time training without per-modality retraining—with state-of-the-art error rates across four imaging types (Wang et al., 21 Sep 2024).
  • Molecular Design: Cross-domain 3D binder generation (peptide, antibody, small molecule) via block-graph representations, equivariant diffusion, and multi-domain training yielding superior accuracy, recovery, and energy metrics (Kong et al., 25 Mar 2025).
  • Unified Multimodal Generation/Understanding: Via Mixture-of-Experts, token pruning, and rotary positional encodings, as in Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025) and UniMoD, supporting omnimodal inputs and efficient scaling.

Results consistently show that unified modalities improve performance, robustness to missing/incomplete modalities, and generalization across tasks, with concrete improvements over split or unimodal architectures in both single- and multimodal settings.

Model Domain(s) Core Innovation Key Metrics/Benchmarks
UNIMO Vision-language Cross-modal contrastive VQA, Flickr, GLUE
UNIMO-2/3 Vision-language Grounded space, gating Retrieval, VQA, GLUE
UniMoCo Multi-modal embedding Modality completion, Lâ‚‚ aux MMEB (36 tasks), ablation
UNIMO-G Text/subject→image generation Multimodal prompts, VELoss MSCOCO FID, DreamBench
UniMo (2512) Video & 3D motion Joint AR, VQ-VAE, expansion Human4DiT, VBench, MPJPE
UniMo (2409) Medical motion correction Joint shape/image, equiv CNN Fetal MRI/CT, multi-modality
UniMoD General multimodal Task-aware MoD pruning FLOPs, GenEval, GQA
UniMoMo 3D molecule design Block-graph, E(3) diffusion PepBench, CBGBench, RAbD
Uni-MoE-2.0 Language-centric omnimodal Adv. MoE, 3D RoPE 85 benchmarks, video, ASR

5. Bias, Robustness, and Modality Imbalance

Modality imbalance—where certain modality combinations dominate training data—can induce significant inference bias in conventional models. UniMoCo directly quantifies and mitigates this by constructing balanced complete/completed samples and enforcing matching in embedding space (variation under 2–3 points versus ∼20-point swings for baselines) (Qin et al., 17 May 2025). Grounding mechanisms and unified sequence structures (in both embedding and generative models) further suppress degenerate solutions and catastrophic forgetting of underrepresented modalities.

Robustness analyses show that modalities equilibrate when explicit modality-completion and cross-modal alignment losses are present. Additionally, methods like UniMo (medical imaging) demonstrate robustness under diverse deformations due to the inclusion of shape-based augmentation and equivariant filter banks, while token pruning (UniMoD) preserves accuracy for generation and understanding tasks even under aggressive reduction in compute.

6. Limitations and Extensions

Common technical limitations include restriction to specific entity types (e.g., UniMo for video/motion lacks explicit hand and facial VQ-VAEs (Pang et al., 3 Dec 2025)), requirement for auxiliary supervision (UniMo medical needs either segmentations or distance transforms (Wang et al., 21 Sep 2024)), and single-person or single-domain focus. Extending to multi-person settings, integrating more granular or continuous latent representations (e.g., for hand/face or finer mesh detail), and further exploring tri-modal and quad-modal synthesis (text-audio-video-motion) are active future directions across the UniMo landscape.

A recurring theme is the trade-off between unified parameterization (maximal cross-modal transfer) and modality-specific innovations (e.g., expert decoders, per-modality tokenization, dedicated routers). Empirical ablations consistently support that hybrid architectures—combining modality-adaptive modules with robust global alignment—achieve the strongest and most general cross-domain performance.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to UniMo.