jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval (2506.18902v2)

Published 23 Jun 2025 in cs.AI, cs.CL, and cs.IR

Abstract: We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

PDF Abstract

Universal Multimodal Multilingual Embeddings: An Analysis of jina-embeddings-v4

The paper "jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval" (Günther et al., 23 Jun 2025 ) presents a 3.8B parameter embedding model that unifies text and image representations, supporting both single-vector and multi-vector (late interaction) embeddings. The model is designed for a broad spectrum of retrieval tasks, including cross-modal semantic similarity, information retrieval, and code search, with a particular emphasis on visually rich content such as tables, charts, and mixed-media documents. The authors also introduce Jina-VDR, a new benchmark for evaluating retrieval on visually complex documents.

Model Architecture and Training Paradigm

The model leverages Qwen2.5-VL as its backbone, extending it with a dual-mode output: users can select between traditional single-vector embeddings (2048 dimensions, truncatable via Matryoshka Representation Learning) and ColBERT-style multi-vector embeddings (128 dimensions per token). This flexibility enables both efficient dense retrieval and high-precision late interaction strategies.

A key architectural innovation is the use of a single, shared encoder for both text and image modalities. Images are preprocessed into "image tokens" and passed through the same transformer stack as text, enabling true multimodal fusion and sharply reducing the modality gap observed in dual-encoder architectures such as CLIP. This design is empirically shown to yield superior cross-modal alignment and more effective use of the embedding space.

Task specialization is achieved via LoRA adapters, with three variants trained for asymmetric query-document retrieval, symmetric semantic similarity, and code retrieval. Each adapter adds only 60M parameters, allowing efficient task switching at inference with minimal memory overhead.

Training proceeds in two phases: (1) contrastive pretraining on text-text and text-image pairs using a joint InfoNCE and Matryoshka loss, and (2) task-specific fine-tuning of the LoRA adapters using hard negatives and, where available, ground-truth similarity scores (e.g., CoSENT loss for STS tasks).

Jina-VDR: Benchmarking Visually Rich Document Retrieval

The Jina-VDR benchmark is a significant contribution, extending prior work (e.g., ViDoRe) by incorporating 30 new tasks across a wide range of domains, document types, and languages. The benchmark includes both real-world and synthetic datasets, with careful curation to ensure diversity in content and query types. Notably, it covers multilingual retrieval, non-question queries, and complex visual layouts, providing a comprehensive testbed for evaluating multimodal retrieval models.

Empirical Results

The model demonstrates strong performance across a variety of benchmarks:

Visually Rich Document Retrieval: On Jina-VDR and ViDoRe, jina-embeddings-v4 achieves nDCG@5 scores of 72.19 (dense) and 79.29 (late interaction), outperforming prior models by a substantial margin, especially on non-question and multilingual tasks.
Text and Multilingual Retrieval: On MTEB and MMTEB, the model is competitive with or superior to state-of-the-art text embedding models, with notable improvements in long-document retrieval (LongEmbed benchmark).
Cross-Modal Retrieval: On the CLIP benchmark, it surpasses both OpenAI CLIP and prior Jina models in average Recall@5, with particularly strong cross-modal alignment (cosine similarity of 0.71–0.72 on Flickr30K/MSCOCO).
Code Retrieval: The code adapter achieves nDCG@10 scores competitive with generalist models, though specialized models (e.g., voyage-code) retain an edge on code-specific tasks.

The model's ability to sharply reduce the modality gap and cone effect is demonstrated through embedding space analysis, confirming that the shared encoder architecture yields more uniform and semantically meaningful cross-modal representations.

Implications and Future Directions

Practical Implications:

Unified Deployment: The single-model, multi-adapter approach enables organizations to deploy a single embedding service for text, image, and code retrieval, reducing operational complexity and resource requirements.
Flexible Retrieval Strategies: Support for both single-vector and multi-vector outputs allows practitioners to trade off between retrieval speed and precision according to application needs.
Enhanced Multilingual and Multimodal Coverage: The model's strong performance on visually rich, multilingual, and non-standard document types broadens the applicability of embedding-based retrieval in real-world, heterogeneous data environments.

Theoretical Implications:

The results reinforce the value of shared-encoder architectures for cross-modal alignment, challenging the dominance of dual-encoder paradigms in multimodal retrieval.
The effective use of LoRA adapters for task specialization suggests a scalable path for extending embedding models to new domains without retraining the full model.

Limitations and Future Work:

While the model achieves state-of-the-art results on a wide range of tasks, specialized models still outperform it on certain domain-specific benchmarks (e.g., code retrieval).
The 3.8B parameter size, while efficient relative to the breadth of supported tasks, may be prohibitive for some edge or low-resource deployments. The authors indicate plans to develop smaller variants.
Further improvements in low-resource language support and visually complex document understanding remain open challenges.

Conclusion

jina-embeddings-v4 represents a significant advance in universal embedding models, offering a practical, high-performance solution for multimodal, multilingual retrieval. Its architectural choices—shared encoder, dual-mode output, and LoRA-based task specialization—yield strong empirical results and provide a template for future research in unified representation learning. The introduction of the Jina-VDR benchmark further establishes a rigorous standard for evaluating progress in visually rich document retrieval. Future work will likely focus on model efficiency, broader language coverage, and deeper integration of domain-specific knowledge.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Michael Günther (47 papers)
Saba Sturua (8 papers)
Mohammad Kalim Akram (7 papers)
Isabelle Mohr (10 papers)
Andrei Ungureanu (1 paper)
Sedigheh Eslami (6 papers)
Scott Martens (3 papers)
Bo Wang (823 papers)
Nan Wang (147 papers)
Han Xiao (104 papers)
Maximilian Werk (4 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/michael_g_u/status/1937774748991164698

https://twitter.com/JinaAI_/status/1937880662025072646

https://twitter.com/tomaarsen/status/1938257263258702167

https://twitter.com/vivekkalyansk/status/1938447422575046801

https://twitter.com/PapersInML/status/1937572239714467990

YouTube

Show All Videos