VLM2Vec-V2: Unified Multimodal Embedding
- The paper introduces VLM2Vec-V2, a unified multimodal embedding framework that extends previous paradigms by integrating text, images, videos, and visual documents.
- It employs dynamic resolution, M-RoPE, and a shared convolutional frontend, leveraging instruction-guided contrastive loss for optimal cross-modal alignment.
- Empirical evaluations on the MMEB-V2 benchmark demonstrate superior performance and robust generalization across diverse modalities compared to prior methods.
VLM2Vec-V2 is a unified multimodal embedding framework designed to produce semantically aligned representations for text, images, videos, and visual documents within a shared embedding space. The approach advances previous multimodal embedding paradigms, which primarily focused on natural images, by explicitly extending support to temporally and structurally complex visual forms such as videos and multi-page documents. VLM2Vec-V2 leverages a combination of foundation model adaptation, modality-specific input handling, and instruction-guided contrastive learning to enable generalization across a comprehensive range of downstream tasks, including retrieval, classification, and question answering for diverse media types (Meng et al., 7 Jul 2025).
1. Unified Model Architecture
VLM2Vec-V2 is instantiated by fine-tuning a Qwen2-VL 2B vision–language backbone to encode all four modalities. Three architectural elements are central: naive dynamic resolution for variable-size input handling, Multimodal Rotary Position Embedding (M-RoPE) for spatial and temporal position encoding, and a unified convolutional frontend (2D + 3D) parameter-shared across images and videos.
Modality tokens are utilized to declare the input type: <|image_pad|> for images and pages, <|video_pad|> for sequences of video frames. Instruction fusion is implemented by prepending a short, task-specific text prompt as a “prefix” to each query, which is then ingested by the transformer alongside the other tokens.
Modality-specific flows are as follows:
- Text: Standard Qwen2 tokenizer followed by transformer layers.
- Image: Patch embedding → 2D convolutional encoder → M-RoPE → transformer.
- Video: Uniformly sampled 8-frame clips → shared 2D+3D conv backbone → M-RoPE (temporal) → transformer.
- Visual Document: Each page rendered as an image → image encoder → page-wise
[VISUAL_TOKEN]prefix → transformer.
At the model output, the final hidden state of the last token constitutes the embedding vector. No cross-modal fusion layers are used beyond a temperature-scaled cosine similarity head: with trained or fixed (typically ).
2. Training Objectives and Optimization
The training objective is uniformly formulated as a retrieval task over all modalities using a contrastive (InfoNCE) loss. For a batch of query–positive pairs , plus in-batch negatives, the objective is: Instruction templates normalize all inputs, with queries and targets uniformly formatted:
All tasks are cast as retrieval, so the optimization minimizes only this loss across batches: Trainable parameters are restricted to LoRA adapters (rank=16, =32) atop Qwen2-VL. Optimization employs the AdamW optimizer over 2,000–5,000 steps with 8×H100 GPUs, batch size 1,024 (16 sub-batches × 64).
3. MMEB-V2 Benchmark Suite
VLM2Vec-V2 is evaluated on MMEB-V2, an extension of the original MMEB image↔text benchmark, introducing five new meta-task categories for four input modalities: video retrieval, moment (temporal) retrieval, video classification, video QA, and visual document retrieval. The benchmark covers nine meta-tasks spanning 78 datasets.
Table: Sample MMEB-V2 Task Statistics
| Meta-Task | #Query | #Candidates | Modalities |
|---|---|---|---|
| Video Retrieval (DiDeMo) | 1,004 | 1,004 | T+V |
| Moment Retrieval (QVHighlights) | 1,083 | 10 | T+V+V |
| Video Classification (Kinetic-700) | 1,000 | 700 | V+T |
| Video QA (MVBench) | 4,000 | 3–5 | V+T |
| VisDoc (ViDoRe-V1) | 280–1,646 | 70–999 | T+D |
Video preprocessing includes uniform 8-frame sampling per clip. Document pages are rendered at 224×224 (max 20 pages per document). Text is lower-cased and tokenizer-standardized.
Evaluation metrics:
- Hit@K (usually K=1): 0
- NDCG@5 (VisDoc): 1
4. Empirical Results and Ablation Studies
Performance is benchmarked against GME (2B), VLM2Vec (2B), and ColPali (3B) across tasks in images, videos, and visual documents. VLM2Vec-V2 achieves the highest overall average (58.0), outperforming prior comparably sized methods by 3–4 points.
Table: Main Experimental Results
| Model | Image CLS | Image QA | Image RET | Image GD | Video CLS | Video QA | Video RET | Video MRET | VisDoc NDCG@5 |
|---|---|---|---|---|---|---|---|---|---|
| GME (2B) | 54.4 | 29.9 | 66.9 | 55.5 | 34.9 | 42.0 | 25.6 | 32.4 | 72.7 |
| VLM2Vec (2B) | 58.7 | 49.3 | 65.0 | 72.9 | 33.4 | 30.5 | 20.6 | 33.0 | 41.6 |
| ColPali (3B) | 40.3 | 11.5 | 48.1 | 40.3 | 26.7 | 37.8 | 21.6 | 25.5 | 71.0 |
| VLM2Vec-V2 (2B) | 62.9 | 56.3 | 69.5 | 77.3 | 39.3 | 34.3 | 28.8 | 38.5 | 65.4 |
Ablation analyses examine the impact of modality combinations and sub-batch size:
- Joint Training on all three modalities delivers optimal overall results, primarily by improving generalization on non-image domains.
- Sub-batching: Image tasks benefit from moderate sub-batch sizes (e.g., 64), while harder modalities are favored by larger sub-batches.
Table: Effect of Modality Combinations
| Train Data | Image Avg | Video Avg | VisDoc Avg | Overall |
|---|---|---|---|---|
| Image only | 62.5 | 33.9 | 27.9 | 45.2 |
| Image + Video | 63.3 | 34.9 | 51.9 | 48.3 |
| Image + VisDoc | 62.4 | 33.3 | 47.4 | 47.7 |
| All three | 62.7 | 34.6 | 52.2 | 49.1 |
A plausible implication is that cross-modal exposure enables induction of abstract semantic structure not present in unimodal training.
5. Design Principles and Insights
VLM2Vec-V2 demonstrates that unified task-instruction-driven embedding enables seamless integration of diverse modalities—including static images, temporally aligned videos, and multi-page documents—without the need for architecture switching. The use of instruction signals as "soft adapters" both guides cross-modal alignment and permite task conditioning, improving robustness.
Best practices highlighted by empirical analysis include:
- Adopting a large vision–language foundation model (Qwen2-VL) equipped with M-RoPE and dynamic resolution.
- Consistent use of instruction-augmented contrastive loss across all retrieval tasks.
- Controlled, interleaved sub-batching to balance sample hardness and modality diversity.
- Parameter-efficient tuning via LoRA, enabling large-scale adaptation (300k+ video, 480k+ document examples) with a 2B parameter model.
Joint training confers systematic improvements: video representation is grounded by image-text alignment; visual document understanding benefits from image-based VQA and retrieval. This suggests that shared supervision across modalities is key for scalable, robust multimodal representation.
6. Benchmark Impact and Prospective Directions
The introduction of MMEB-V2 accompanies VLM2Vec-V2 and establishes a broad and rigorous testbed for unified retrieval, classification, and QA evaluation over text, image, video, and document modalities. Empirical results indicate that advances in shared embedding via architectural and training innovations can materially extend the range of real-world applications, including AI agents, multimodal search and recommendation, and retrieval-augmented generation.
A plausible implication is that the unified approach outlined by VLM2Vec-V2 and MMEB-V2 will inform future research on scalable multimodal embedding, emphasizing flexible task formulation, instruction-driven alignment, and parameter-efficient adaptation across the evolving landscape of visual and textual data (Meng et al., 7 Jul 2025).