Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemini Embedding 2: Unified Multimodal Model

Updated 28 May 2026
  • Gemini Embedding 2 is a unified multimodal embedding model that integrates video, audio, image, and text using a transformer backbone.
  • It employs large-scale contrastive learning with a multi-task, multi-stage training paradigm to excel in unimodal, cross-modal, and multimodal retrieval tasks.
  • The model achieves state-of-the-art performance on benchmarks and supports advanced applications like retrieval-augmented generation and cross-modal recommendation.

Gemini Embedding 2 is a native multimodal embedding model constructed on a transformer backbone, integrating video, audio, image, and text inputs into a unified representation space. Designed to support arbitrary interleavings of modalities, it employs large-scale contrastive learning in a multi-task, multi-stage framework and achieves state-of-the-art results on unimodal, cross-modal, and multimodal retrieval tasks. The model leverages the Gemini transformer’s cross-modal attention capabilities to generalize across a diverse array of benchmarks, supporting both standard and fine-grained information retrieval, recommendation, and Retrieval-Augmented Generation (RAG) use cases. Its robust zero-shot transfer performance in specialized domains highlights the effectiveness of its unified embedding architecture (Shanbhogue et al., 26 May 2026).

1. Model Architecture

Gemini Embedding 2 incorporates a multimodal transformer (“Gemini”) with bidirectional attention as its backbone. The architecture natively ingests text, images, audio, and video, as well as their arbitrary interleavings, producing embeddings in Rd\mathbb{R}^d with d=3, ⁣072d=3,\!072.

  • Tokenization and Modality Front-Ends:
    • Text is processed using Gemini’s tokenizer with standard subword tokenization.
    • Images are converted by a vision front-end into sequences of visual tokens.
    • Videos are sampled uniformly (up to 32 frames at 1 fps), with frames processed by the vision front-end and token sequences concatenated.
    • Audio is input as raw waveform and encoded via an audio front-end using convolutional subsampling and token embedding.

The resulting token sequences for all modalities are concatenated to yield a unified token sequence TZL\mathbf T \in \mathbb{Z}^L. This sequence passes through the Gemini transformer M\mathcal M, yielding token embeddings Tembed=M(T)RL×dM\mathbf T_{\mathrm{embed}} = \mathcal M(\mathbf T) \in \mathbb{R}^{L \times d_\mathcal{M}}. A mean pooler P\mathcal P computes

Pembed=P(Tembed)=1L=1LTembed,RdM\mathbf P_{\mathrm{embed}} = \mathcal P(\mathbf T_{\mathrm{embed}}) = \frac{1}{L} \sum_{\ell=1}^L \mathbf T_{\mathrm{embed}, \ell} \in \mathbb{R}^{d_\mathcal{M}}

and a randomly initialized linear projection f:RdMRdf : \mathbb{R}^{d_\mathcal{M}} \to \mathbb{R}^d maps the pooled vector to the final embedding E=f(Pembed)\mathbf E = f(\mathbf P_{\mathrm{embed}}) with d=3,072d=3,072.

2. Multi-Task, Multi-Stage Training Paradigm

Gemini Embedding 2 is trained using large-scale, multi-task contrastive learning with several distinctive features:

  • Noise Contrastive Objective: Each batch provides query d=3, ⁣072d=3,\!0720, positive d=3, ⁣072d=3,\!0721, and optionally a hard negative d=3, ⁣072d=3,\!0722. The per-example loss

d=3, ⁣072d=3,\!0723

where d=3, ⁣072d=3,\!0724, batch size d=3, ⁣072d=3,\!0725, and temperature d=3, ⁣072d=3,\!0726. Masking prevents identical queries or positives from being counted as negatives.

  • Matryoshka Representation Learning (MRL): The contrastive loss is simultaneously applied to overlapping sub-vectors of the embedding (first 768, 1,536, and all 3,072 dimensions), supporting multiple output dimensionalities.
  • Training Stages:
    • Pre-Fine-Tuning (PFT): Converts autoregressive Gemini into an encoder using large, noisy, multi-task data (text, code, images), with batch sizes d=3, ⁣072d=3,\!0727.
    • Fine-Tuning (FT): Adds tasks including text, document, code, image, audio, and video, tuning batch sizes (256–2,048) and sampling rates for modality balance.
    • Model Soup: Conducts a (weighted or unweighted) average of multiple FT checkpoints to improve generalization.

3. Hyperparameter Choices and Data Composition

Key parameters adopted in Gemini Embedding 2 include:

Parameter Value/Description
Embedding dimension (d=3, ⁣072d=3,\!0728) 3,072 (MRL at 768, 1,536 supported)
Transformer hidden size (d=3, ⁣072d=3,\!0729) 2,560 (from Gemini)
Temperature (TZL\mathbf T \in \mathbb{Z}^L0) 0.07
Optimizer AdamW, TZL\mathbf T \in \mathbb{Z}^L1, TZL\mathbf T \in \mathbb{Z}^L2, weight decay = 0.01
Learning Rates PFT: TZL\mathbf T \in \mathbb{Z}^L3 (warm-up) to TZL\mathbf T \in \mathbb{Z}^L4, FT: TZL\mathbf T \in \mathbb{Z}^L5 to TZL\mathbf T \in \mathbb{Z}^L6
Data Scale Trillions of examples, >50 tasks

Data sources span COCO, Flickr30k, DOCCI, TextCaps (image–text), VATEX, MSR-VTT, YouCook2 (video–text), multiple speech corpora (audio–text), ViDoRe V2 (document retrieval), synthetic code pairs, and text-only tasks from MMTEB and MTEB Code.

4. Benchmark Performance

Gemini Embedding 2 demonstrates superior retrieval and embedding capabilities across unimodal, cross-modal, and multimodal settings, as summarized in the following table (selected tasks, R@1 or NDCG@10):

Task/Modality Metric Gemini Embedding 2 Best Comparator
Image→Image (ImageNet) R@1 83.6 71.8 (Legacy)
Text→Image (MSCOCO) R@1 62.9 58.1 (Voyage)
Text→Video (VATEX) NDCG@10 68.8 60.3 (Nova)
Image+Text→Text (EncycloVQA) R@20 71.5 58.6 (Voyage)
Document Retrieval (ViDoRe V2) NDCG@10 64.9 65.5 (Voyage)
MMTEB (Multilingual avg.) mean 69.9 63.8 (Nova)
MTEB (Code) mean 84.0

The model outperforms previous approaches including Amazon Nova MME and Voyage-3.5-multimodal, with an overall mean (intersection of tasks) of 77.2, compared to 68.2 (Nova), 70.0 (Voyage), and 64.1 (Legacy).

Key architectural elements—such as deep, shared transformer integration of all modalities—yield improved cross-modal alignment, notably for complex multi-modal queries (e.g., long captions, VATEX, YouCook2). Native audio modeling without cascaded ASR preserves prosodic and acoustic features, resulting in a +3.6 mrr@10 increase on MSEB retrieval.

5. Generalization and Zero-Shot Domain Evaluation

Gemini Embedding 2 exhibits reliable zero-shot transfer in specialized domains, as measured on distinct datasets:

Model MicroVQA ArtCap AstroLLaVA Recipe1M Ingredients Recipe1M Instructions
CLIP Base 34.1 34.1 21.2 64.6 61.1
CLIP Large 46.7 52.2 31.6 76.0 75.6
ALIGN Base 48.1 49.2 18.4 70.3 70.8
TIPS Giant 20.0 65.2 10.1 66.0 65.6
Voyage-3.5 53.3 48.7 30.3
Gemini Embedding 2 79.3 67.7 64.4 90.2 92.1

On MicroVQA (microscopy) and AstroLLaVA (astronomy), Gemini Embedding 2 achieves more than double the recall@5 compared to prior models, attributable to unified multimodal reasoning (e.g., layout and text in scientific images). High recall (90%+) on Recipe1M for both ingredients and instructions highlights the model's fine-grained visual–textual alignment and low performance variance across disparate fields.

6. Principal Applications

Gemini Embedding 2 is intended as a foundation for varied downstream applications:

  • Retrieval-Augmented Generation (RAG): Enables indexing and retrieval over text, figures, PDFs, and slides. Multimodal embedding supports direct search with text/image queries and eliminates the need for separate OCR or captioning.
  • Recommendation: Unified embeddings for items such as videos, podcasts, and multimodal recipes allow cross-modal similarity-based recommendation; signals from all modalities are fused in scoring.
  • Semantic Search: Provides a single-model solution for vector search across text, documents, slides, recorded calls, user-generated images, and more. Queries in any modality retrieve relevant multimodal results, and embeddings are served via vector databases including FAISS.

7. Limitations and Prospects for Further Development

Identified limitations include the storage and maintenance burden introduced by the model soup procedure and the remaining potential for incremental improvement via light in-domain fine-tuning. Compute demands for real-time, high-FPS video/audio indexing persist.

Future avenues for research include:

  • Integration of ranking/data signals (e.g., click-through, dwell time) to optimize retrieval quality end-to-end.
  • Joint fine-tuning of the embedding model and LLM-based generators for agentic, closed-loop RAG scenarios.
  • Expanded evaluations with complex interleaved and conversational multimodal queries.
  • Further optimization of inference latency, particularly for on-device and multi-modal settings.

Gemini Embedding 2 establishes a high-capacity, unified embedding framework with native multimodal support, state-of-the-art performance, robust domain generalization, and adaptability for major retrieval and recommendation tasks. Its training algorithm, architectural integration across modalities, and performance on comprehensive benchmarks mark it as a versatile resource for next-generation multimodal systems (Shanbhogue et al., 26 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Embedding 2.