jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval (2506.18902v1)

Published 23 Jun 2025 in cs.AI, cs.CL, and cs.IR

Abstract: We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

PDF Abstract

Universal Multimodal Multilingual Embeddings: An Analysis of jina-embeddings-v4

The paper "jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval" (Günther et al., 23 Jun 2025 ) presents a 3.8B parameter model that unifies text and image representations for a broad spectrum of retrieval tasks, including visually rich document retrieval and code search. The model is distinguished by its support for both single-vector and multi-vector (late interaction) embeddings, task-specific LoRA adapters, and a new benchmark (Jina-VDR) for evaluating retrieval on visually complex documents.

Model Architecture and Training Paradigm

jina-embeddings-v4 is built on the Qwen2.5-VL backbone, leveraging a single transformer-based architecture for both text and image modalities. Unlike CLIP-style dual-encoder models, this approach processes both modalities through a unified path, reducing the modality gap and enabling true multimodal input handling. Images are converted into sequences of "image tokens" and passed to the LLM alongside text tokens, facilitating joint semantic representation.

The model supports two output modes:

Single-vector embeddings: 2048-dimensional, truncatable to 128 dimensions via Matryoshka Representation Learning, enabling efficient storage and computation.
Multi-vector embeddings: 128-dimensional vectors per token, suitable for late interaction retrieval strategies (e.g., ColBERT-style), providing higher retrieval precision at increased computational cost.

Task specialization is achieved through LoRA adapters, each with 60M parameters, trained for:

Asymmetric query-document retrieval
Symmetric semantic similarity
Code retrieval

Adapters are selected at inference time, allowing the model to adapt to diverse retrieval scenarios without significant memory overhead.

Training proceeds in two phases: initial contrastive pair training (text-text and text-image) with InfoNCE and Matryoshka loss, followed by task-specific fine-tuning of the LoRA adapters using appropriate triplet or similarity-based objectives.

Jina-VDR Benchmark: Evaluating Visually Rich Retrieval

The introduction of the Jina-VDR benchmark addresses the lack of comprehensive evaluation datasets for visually rich document retrieval. Jina-VDR extends prior work (e.g., ViDoRe) by including:

30 new tasks across domains (legal, historical, marketing, etc.)
Multilingual coverage (up to 20 languages)
Diverse document types (charts, tables, maps, rendered markdown, etc.)
Both real-world and synthetic data, with LLM-generated queries to ensure relevance

This benchmark enables robust assessment of models' ability to integrate textual and visual information for retrieval tasks beyond simple OCR or question answering.

Empirical Results

jina-embeddings-v4 demonstrates strong performance across a wide range of benchmarks:

Model	J-VDR	ViDoRe	CLIPB	MMTEB	MTEB-en	COIR	LEMB	STS-m	STS-en
jina-embeddings-v4 (dense)	72.19	84.11	84.11	66.49	55.97	71.59	67.11	72.70	85.89
jina-embeddings-v4 (late)	79.29	90.17	—	—	—	—	—	—	—
jina-embeddings-v3	47.06	26.02	—	58.58	54.33	55.07	55.66	75.77	85.82
voyage-3	—	—	—	66.13	53.46	67.23	74.06	68.33	78.59
gemini-embedding-001	—	—	—	67.71	64.35	73.11	—	78.35	85.29

Key empirical findings:

State-of-the-art on visually rich retrieval: On Jina-VDR and ViDoRe, jina-embeddings-v4 (especially in late interaction mode) outperforms both CLIP-style and other VLM-based models by a significant margin.
Competitive on text and code retrieval: Performance on MMTEB, MTEB, and code retrieval (COIR) is on par with or exceeds specialized models, with only highly specialized code models (e.g., voyage-code) showing higher scores on code tasks.
Superior cross-modal alignment: The model achieves a cross-modal alignment score of 0.71 (Flickr30K), compared to 0.15 for OpenAI CLIP, indicating a substantial reduction in the modality gap.
Robust multilingual and long-context support: The model handles up to 32k tokens and supports retrieval in over 20 languages, with strong results on long document and multilingual benchmarks.

Analysis of Embedding Space

The unified encoder architecture leads to several notable properties:

Reduced modality gap: The distribution of cosine similarities between matching image-text and text-text pairs is much closer, facilitating more reliable cross-modal retrieval.
Mitigated cone effect: The model's embedding space shows a greater spread and separation between positive and negative pairs, indicating more effective use of the embedding space and better discrimination.
Flexible truncation: Matryoshka Representation Learning allows for dynamic adjustment of embedding dimensionality with minimal loss, supporting resource-constrained deployments.

Practical Implications and Deployment Considerations

Deployment strategies:

Unified model for multiple modalities and tasks: Organizations can replace multiple specialized models with a single deployment, reducing operational complexity and resource usage.
Adapter-based specialization: LoRA adapters enable efficient task switching without retraining or significant memory overhead.
Scalability: The model's support for both single- and multi-vector outputs allows practitioners to trade off between retrieval precision and computational/storage cost depending on application requirements.

Resource requirements:

3.8B parameters for the base model, with each LoRA adapter adding 60M parameters.
Inference can be performed on modern GPUs; for large-scale retrieval, late interaction mode will require more storage and compute due to multi-vector representations.

Limitations:

While the model is competitive on code retrieval, highly specialized code models still outperform it on some benchmarks.
Performance on low-resource languages is limited by the Qwen2.5-VL backbone's language coverage.
Multi-vector (late interaction) retrieval, while more precise, incurs higher computational and storage costs.

Theoretical and Future Directions

The paper demonstrates that a unified, cross-modal encoder architecture can substantially reduce the modality gap and improve retrieval performance across diverse tasks. This supports the hypothesis that shared semantic spaces, when properly trained, are more effective than dual-encoder approaches for universal embedding.

Potential future developments:

Smaller, more efficient variants: Techniques such as quantization, pruning, or knowledge distillation could yield lighter models suitable for edge deployment.
Broader language and modality coverage: Extending the backbone to cover more languages and additional modalities (e.g., audio, video) would further generalize the approach.
Instruction-based or dynamic adapters: Integrating instruction-tuning or dynamic adapter selection could further improve task adaptability.

Conclusion

jina-embeddings-v4 represents a significant advance in universal embedding models, offering a practical, high-performance solution for multimodal, multilingual retrieval tasks. Its architecture and training paradigm provide a blueprint for future models aiming to unify semantic representation across modalities and tasks, with strong empirical evidence supporting its effectiveness in real-world retrieval scenarios.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Michael Günther (47 papers)
Saba Sturua (8 papers)
Mohammad Kalim Akram (7 papers)
Isabelle Mohr (10 papers)
Andrei Ungureanu (1 paper)
Sedigheh Eslami (6 papers)
Scott Martens (3 papers)
Bo Wang (823 papers)
Nan Wang (147 papers)
Han Xiao (104 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/michael_g_u/status/1937774748991164698

https://twitter.com/JinaAI_/status/1937880662025072646

https://twitter.com/tomaarsen/status/1938257263258702167

https://twitter.com/vivekkalyansk/status/1938447422575046801

https://twitter.com/PapersInML/status/1937572239714467990