Gemini Embedding 2: Unified Multimodal Model
- Gemini Embedding 2 is a unified multimodal embedding model that integrates video, audio, image, and text using a transformer backbone.
- It employs large-scale contrastive learning with a multi-task, multi-stage training paradigm to excel in unimodal, cross-modal, and multimodal retrieval tasks.
- The model achieves state-of-the-art performance on benchmarks and supports advanced applications like retrieval-augmented generation and cross-modal recommendation.
Gemini Embedding 2 is a native multimodal embedding model constructed on a transformer backbone, integrating video, audio, image, and text inputs into a unified representation space. Designed to support arbitrary interleavings of modalities, it employs large-scale contrastive learning in a multi-task, multi-stage framework and achieves state-of-the-art results on unimodal, cross-modal, and multimodal retrieval tasks. The model leverages the Gemini transformer’s cross-modal attention capabilities to generalize across a diverse array of benchmarks, supporting both standard and fine-grained information retrieval, recommendation, and Retrieval-Augmented Generation (RAG) use cases. Its robust zero-shot transfer performance in specialized domains highlights the effectiveness of its unified embedding architecture (Shanbhogue et al., 26 May 2026).
1. Model Architecture
Gemini Embedding 2 incorporates a multimodal transformer (“Gemini”) with bidirectional attention as its backbone. The architecture natively ingests text, images, audio, and video, as well as their arbitrary interleavings, producing embeddings in with .
- Tokenization and Modality Front-Ends:
- Text is processed using Gemini’s tokenizer with standard subword tokenization.
- Images are converted by a vision front-end into sequences of visual tokens.
- Videos are sampled uniformly (up to 32 frames at 1 fps), with frames processed by the vision front-end and token sequences concatenated.
- Audio is input as raw waveform and encoded via an audio front-end using convolutional subsampling and token embedding.
The resulting token sequences for all modalities are concatenated to yield a unified token sequence . This sequence passes through the Gemini transformer , yielding token embeddings . A mean pooler computes
and a randomly initialized linear projection maps the pooled vector to the final embedding with .
2. Multi-Task, Multi-Stage Training Paradigm
Gemini Embedding 2 is trained using large-scale, multi-task contrastive learning with several distinctive features:
- Noise Contrastive Objective: Each batch provides query 0, positive 1, and optionally a hard negative 2. The per-example loss
3
where 4, batch size 5, and temperature 6. Masking prevents identical queries or positives from being counted as negatives.
- Matryoshka Representation Learning (MRL): The contrastive loss is simultaneously applied to overlapping sub-vectors of the embedding (first 768, 1,536, and all 3,072 dimensions), supporting multiple output dimensionalities.
- Training Stages:
- Pre-Fine-Tuning (PFT): Converts autoregressive Gemini into an encoder using large, noisy, multi-task data (text, code, images), with batch sizes 7.
- Fine-Tuning (FT): Adds tasks including text, document, code, image, audio, and video, tuning batch sizes (256–2,048) and sampling rates for modality balance.
- Model Soup: Conducts a (weighted or unweighted) average of multiple FT checkpoints to improve generalization.
3. Hyperparameter Choices and Data Composition
Key parameters adopted in Gemini Embedding 2 include:
| Parameter | Value/Description |
|---|---|
| Embedding dimension (8) | 3,072 (MRL at 768, 1,536 supported) |
| Transformer hidden size (9) | 2,560 (from Gemini) |
| Temperature (0) | 0.07 |
| Optimizer | AdamW, 1, 2, weight decay = 0.01 |
| Learning Rates | PFT: 3 (warm-up) to 4, FT: 5 to 6 |
| Data Scale | Trillions of examples, >50 tasks |
Data sources span COCO, Flickr30k, DOCCI, TextCaps (image–text), VATEX, MSR-VTT, YouCook2 (video–text), multiple speech corpora (audio–text), ViDoRe V2 (document retrieval), synthetic code pairs, and text-only tasks from MMTEB and MTEB Code.
4. Benchmark Performance
Gemini Embedding 2 demonstrates superior retrieval and embedding capabilities across unimodal, cross-modal, and multimodal settings, as summarized in the following table (selected tasks, R@1 or NDCG@10):
| Task/Modality | Metric | Gemini Embedding 2 | Best Comparator |
|---|---|---|---|
| Image→Image (ImageNet) | R@1 | 83.6 | 71.8 (Legacy) |
| Text→Image (MSCOCO) | R@1 | 62.9 | 58.1 (Voyage) |
| Text→Video (VATEX) | NDCG@10 | 68.8 | 60.3 (Nova) |
| Image+Text→Text (EncycloVQA) | R@20 | 71.5 | 58.6 (Voyage) |
| Document Retrieval (ViDoRe V2) | NDCG@10 | 64.9 | 65.5 (Voyage) |
| MMTEB (Multilingual avg.) | mean | 69.9 | 63.8 (Nova) |
| MTEB (Code) | mean | 84.0 | – |
The model outperforms previous approaches including Amazon Nova MME and Voyage-3.5-multimodal, with an overall mean (intersection of tasks) of 77.2, compared to 68.2 (Nova), 70.0 (Voyage), and 64.1 (Legacy).
Key architectural elements—such as deep, shared transformer integration of all modalities—yield improved cross-modal alignment, notably for complex multi-modal queries (e.g., long captions, VATEX, YouCook2). Native audio modeling without cascaded ASR preserves prosodic and acoustic features, resulting in a +3.6 mrr@10 increase on MSEB retrieval.
5. Generalization and Zero-Shot Domain Evaluation
Gemini Embedding 2 exhibits reliable zero-shot transfer in specialized domains, as measured on distinct datasets:
| Model | MicroVQA | ArtCap | AstroLLaVA | Recipe1M Ingredients | Recipe1M Instructions |
|---|---|---|---|---|---|
| CLIP Base | 34.1 | 34.1 | 21.2 | 64.6 | 61.1 |
| CLIP Large | 46.7 | 52.2 | 31.6 | 76.0 | 75.6 |
| ALIGN Base | 48.1 | 49.2 | 18.4 | 70.3 | 70.8 |
| TIPS Giant | 20.0 | 65.2 | 10.1 | 66.0 | 65.6 |
| Voyage-3.5 | 53.3 | 48.7 | 30.3 | – | – |
| Gemini Embedding 2 | 79.3 | 67.7 | 64.4 | 90.2 | 92.1 |
On MicroVQA (microscopy) and AstroLLaVA (astronomy), Gemini Embedding 2 achieves more than double the recall@5 compared to prior models, attributable to unified multimodal reasoning (e.g., layout and text in scientific images). High recall (90%+) on Recipe1M for both ingredients and instructions highlights the model's fine-grained visual–textual alignment and low performance variance across disparate fields.
6. Principal Applications
Gemini Embedding 2 is intended as a foundation for varied downstream applications:
- Retrieval-Augmented Generation (RAG): Enables indexing and retrieval over text, figures, PDFs, and slides. Multimodal embedding supports direct search with text/image queries and eliminates the need for separate OCR or captioning.
- Recommendation: Unified embeddings for items such as videos, podcasts, and multimodal recipes allow cross-modal similarity-based recommendation; signals from all modalities are fused in scoring.
- Semantic Search: Provides a single-model solution for vector search across text, documents, slides, recorded calls, user-generated images, and more. Queries in any modality retrieve relevant multimodal results, and embeddings are served via vector databases including FAISS.
7. Limitations and Prospects for Further Development
Identified limitations include the storage and maintenance burden introduced by the model soup procedure and the remaining potential for incremental improvement via light in-domain fine-tuning. Compute demands for real-time, high-FPS video/audio indexing persist.
Future avenues for research include:
- Integration of ranking/data signals (e.g., click-through, dwell time) to optimize retrieval quality end-to-end.
- Joint fine-tuning of the embedding model and LLM-based generators for agentic, closed-loop RAG scenarios.
- Expanded evaluations with complex interleaved and conversational multimodal queries.
- Further optimization of inference latency, particularly for on-device and multi-modal settings.
Gemini Embedding 2 establishes a high-capacity, unified embedding framework with native multimodal support, state-of-the-art performance, robust domain generalization, and adaptability for major retrieval and recommendation tasks. Its training algorithm, architectural integration across modalities, and performance on comprehensive benchmarks mark it as a versatile resource for next-generation multimodal systems (Shanbhogue et al., 26 May 2026).