UniMoCo: Unified Multi-modal & Contrastive Learning
- UniMoCo is a unified framework that integrates modality completion to generate robust vision–language embeddings by compensating for missing modalities.
- The architecture combines a vision encoder, language model, and a lightweight text-to-image module to ensure consistent embedding alignment across diverse modality pairings.
- UniMoCo extends momentum contrast to work under supervised, semi-supervised, and self-supervised regimes, achieving near-uniform performance and competitive benchmarks.
UniMoCo refers to two distinct but thematically related model frameworks addressing (1) robust multi-modal embeddings in vision–language tasks (Qin et al., 17 May 2025), and (2) unified contrastive visual representation learning across supervised, semi-supervised, and self-supervised regimes (Dai et al., 2021). Both lines share a unification principle and address challenges in balancing diverse data annotation scenarios for contrastive representation learning.
1. Multi-Modal UniMoCo: Unified Modality Completion for Vision–Language Embeddings
Definition and Motivation
The UniMoCo framework introduced by Qin et al. (2024) targets robust multi-modal embeddings in scenarios where queries and targets may contain diverse combinations of modalities, e.g., text, image, or text+image, across tasks such as retrieval, classification, and visual grounding. Conventional vision–LLMs (VLMs) are challenged by heterogeneous modality pairings, often leading to performance drops when facing minority or OOD modality combinations in inference. UniMoCo remedies this by enforcing modality completeness and embedding consistency regardless of original query–target pairs (Qin et al., 17 May 2025).
2. Model Architecture and Modality-Completion
UniMoCo leverages a backbone vision–LLM (LVLM), such as Phi-3.5V or Qwen2-VL-7B, which handles both image patch tokens and text tokens. The key architectural innovation is the modality-completion module:
- Vision Encoder (): Transforms input image into patch embeddings .
- LLM (): Processes concatenated image and text tokens, producing an output embedding .
- Modality-Completion Module: Activated when image data is missing, employing a lightweight T2I model (, e.g., Phi-1.5B or Qwen2-1.5B) to map text into a pseudo-visual token sequence . An auxiliary encoder further refines this pseudo-visual representation.
- Projection Head (): Maps 0 to an embedding space 1, typically with 2.
For missing images, text is padded and mapped through 3 and 4, then integrated into the LVLM in place of real visual tokens.
3. Training Regimen and Loss Functions
UniMoCo employs both a contrastive and an auxiliary loss to enforce embedding alignment among all modality pairings:
- Contrastive InfoNCE Loss (5): Standard cross-batch contrast between query and positive/negative candidate embeddings with cosine similarity and temperature scaling (6).
- Auxiliary Completion Alignment Loss (7): Symmetric cross-entropy alignment between embeddings of original multi-modal samples and their modality-completed (pseudo-visual) variants.
- Combined Objective: 8, with 9 optimal in experiments.
All embeddings are 0-normalized prior to similarity computation, and temperature 1 controls InfoNCE softness.
4. Bias Quantification and Robustness
UniMoCo explicitly identifies and mitigates the modality combination imbalance problem—dominant modality pairs in training cause sub-par performance for underrepresented combinations. Quantified via the “∆Score” (performance gap between dominant and rare modality arrangements), legacy models such as VLM2VEC showed ∆Score values up to 43. In contrast, UniMoCo maintains a near-uniform performance (variation of only 2–5 points) across all modality combinations under the same data skew, directly attributable to the modality completion and auxiliary alignment paradigm.
5. Experimental Evaluation
UniMoCo was evaluated on 36 multi-modal tasks (classification, VQA, retrieval, grounding) drawn from the MMEB suite, split into in-distribution and OOD datasets. Precision@1 results were computed across all possible modality pair scenarios (T+I→T, T→T+I, T+I→T+I). The following summarizes the main results:
| Model | Overall | IND | OOD | (T+I→T) | (T→T+I) | (T+I→T+I) |
|---|---|---|---|---|---|---|
| CLIP (no FT) | 37.8 | 37.1 | 38.7 | 29.8 | 62.1 | 41.6 |
| UniIR (no FT) | 42.8 | 44.7 | 40.4 | 32.5 | 58.2 | 59.7 |
| UniMoCo (Phi-3.5V, FT) | 61.7 | 68.2 | 53.5 | 57.7 | 72.8 | 64.1 |
| UniMoCo (Qwen2-VL-7B, FT) | 63.2 | 67.0 | 58.4 | 59.6 | 73.6 | 65.1 |
Ablations demonstrated the necessity of all components in the completion module, with the T2I model scale and auxiliary loss weight 2 best at 3 (Qin et al., 17 May 2025).
6. Supervised, Semi-Supervised, and Self-Supervised UniMoCo for Visual Representation
In a separate line, UniMoCo (2021) extends Momentum Contrast (MoCo) to handle arbitrary proportions of labeled and unlabeled visual data within a unified pipeline (Dai et al., 2021):
- Foundation: Two encoders—query (4) and key (5), with 6 as an exponential moving average of 7.
- Positive Label Queue: In addition to a feature queue, introduces a label queue to match queries with all queue entries of the same class, enabling “multi-positive” pair formation even in semi-supervised settings.
- Unified Contrastive Loss (UniCon): Generalizes InfoNCE to arbitrary positive/negative splits:
8
where 9, thus directly managing positive/negative ratio in a pairwise fashion.
UniMoCo is effective in fully self-supervised, any-ratio semi-supervised, and full-supervised regimes—linear classifiers trained on frozen features show top-1 ImageNet accuracy ramping from 71.1% (0% labels) to 76.4% (100% labels), matching cross-entropy supervised benchmarks. Mask R-CNN COCO AP and PASCAL VOC mAP rise or plateau with increasing label ratios (Dai et al., 2021).
7. Implementation and Availability
- Vision–Language UniMoCo: Public codebase at https://github.com/HobbitQia/UniMoCo. All completion, vision encoder, and LoRA modules implemented using HuggingFace Transformers and Diffusers; primary routine in
train_unimoco.py. Training uses 8 × NVIDIA A100 GPUs (BF16), typical runs range from 135–185 hours depending on model size. - Contrastive UniMoCo: Follows MoCoV2 recipe with dual ResNet-50 encoders, large feature and label queues (0), SGD training, and strong data augmentations (Dai et al., 2021).
UniMoCo unifies multi-modal embedding and labeled–unlabeled contrastive learning, providing robust solutions to modality and supervision imbalances in modern deep learning workflows (Qin et al., 17 May 2025, Dai et al., 2021).