Jina-Embeddings-v4
- Jina-Embeddings-v4 is a universal, multimodal, multilingual embedding model unifying text and image representations in a single semantic space for advanced retrieval.
- Its architecture uses a unified transformer pathway for both modalities, producing dense or multi-vector embeddings with Matryoshka Representation Learning and late interaction for efficiency and precision.
- The model achieves state-of-the-art results on benchmarks including visually rich content and uses task-specific LoRA adapters for flexible application.
Jina-Embeddings-v4 is a universal, multimodal, multilingual embedding model that unifies text and image representations within a single semantic space. Built atop a 3.8 billion parameter Qwen2.5-VL backbone, the architecture is specifically designed for advanced retrieval and semantic matching in scenarios that include both single-modal (text, image) and cross-modal (text–image) tasks, with particular strength in handling visually rich or mixed-media inputs such as tables, charts, diagrams, and rendered documents.
1. Unified Multimodal Embedding Architecture
The core of Jina-Embeddings-v4 is a joint transformer-based architecture that processes both natural language text and image data in a unified pathway. For textual data, standard tokenization processes are used to produce input sequences. Images are encoded using a discrete image encoder yielding a sequence of image "tokens" (multi-vector representation), which are then processed alongside text tokens through the same transformer stack. This design contrasts with dual-encoder models, such as CLIP, by fusing both modalities into a tightly aligned semantic space, thus minimizing the modal gap and supporting arbitrary compositions of multimodal input (e.g., documents containing both text and figures).
The output embeddings are produced in two modes:
- Single-vector (dense) embeddings: Produced by mean-pooling transformer outputs, yielding a 2048-dimensional vector that can be truncated at inference time due to Matryoshka Representation Learning (MRL) without significant loss in semantic accuracy.
- Multi-vector (late-interaction) embeddings: The final transformer layer outputs per token (text or image patch), resulting in sequences of 128-dimensional vectors. This facilitates the use of late interaction retrieval strategies such as those in ColBERT or ColPALI, enhancing fine-grained cross-modal linking.
2. Late Interaction and Matryoshka Representation Learning
For advanced retrieval use cases, Jina-Embeddings-v4 implements late interaction mechanisms. In this paradigm, retrieval similarity is computed not by a sole cosine similarity between global vectors, but by aggregating local (token-based) similarities:
where and are embedding vectors for the -th and -th tokens in the query and passage/image respectively. This approach enables precise alignment, crucial for visually rich documents, tables, and diagrams.
To optimize for both efficiency and deployability, the model employs Matryoshka Representation Learning (MRL). Here, loss functions are computed on progressively truncated versions of embedding vectors (e.g., 1024, 512, ... 64 dimensions). This ensures that shorter vectors carry nearly the same semantic discriminability as full-length embeddings, facilitating storage and computation trade-offs in large-scale deployments.
3. Task-Specific LoRA Adapters
Jina-Embeddings-v4 integrates Low-Rank Adaptation (LoRA) adapters, which are lightweight parameter modules attached to the backbone for task specialization:
- Retrieval adapters: Tuned for asymmetric query–document configurations, effective for heterogeneous search (e.g., queries vs. long-form documents).
- Text matching adapters: Targeted at symmetric semantic similarity tasks.
- Code adapters: Optimized for code search and text–code retrieval scenarios.
Each adapter introduces fewer than 2% additional parameters, enabling efficient multi-task optimization and easy parameter swapping at inference. Fine-tuning with LoRA allows rapid adaptation to new domains or specific retrieval tasks without retraining the entire backbone.
4. Evaluation and Benchmark Results
Jina-Embeddings-v4 demonstrates state-of-the-art performance across a range of retrieval and semantic similarity benchmarks:
- Single-modal and multilingual retrieval: Comparable or better nDCG@10 and clustering measures than previous open-source and proprietary systems on MTEB (English), MMTEB (multilingual), and semantic similarity tasks.
- Cross-modal retrieval: Superior results on CLIP-Bench for text-to-image and image-to-text matching, with much higher cross-modal alignment scores (e.g., average cosine similarity between matching image-text pairs is 0.71 for Jina-v4 vs. 0.15 for CLIP).
- Visually rich content: Jina-Embeddings-v4 outperforms prior models on the newly introduced Jina-VDR (Visually rich Document Retrieval) benchmark, as well as ViDoRe, which test real-world scenarios involving charts, tables, rendered markdown, code snippets, and mixed-language/country formats.
Results from the Jina-VDR show late-interaction mode boosts average nDCG@5 by 5–10 points compared to dense vectors, and sets new marks for infographics, tables, slides, and multilingual scanned documents.
5. Processing and Retrieval of Visually Rich Content
A significant strength of Jina-Embeddings-v4 is its unified processing of complex, visually grounded documents. Through the blend of unified encoding and late-interaction scoring:
- Structured elements—such as tables, rendered PDFs, diagrams—are natively represented in the same semantic space as text, allowing retrieval over mixed media.
- Queries can be in text, image, or mixed form, and retrieval targets can be of arbitrary document format, including advertisements, government forms, historic scans, markdown codebases, and more.
- This approach yields state-of-the-art retrieval accuracy on tasks that defeat earlier pipelines combining OCR with dense retrieval or dual-encoder CLIP methods.
6. The Jina-VDR Benchmark
In parallel with the model, the authors introduce the Jina-VDR benchmark suite, specifically designed to evaluate retrieval on visually rich, multilingual, and complex document types. The benchmark includes over 30 new datasets spanning scientific tables (ChartQA, TableVQA), markdown (GitHub README in multiple languages), descriptions for mixed-media documents, and various real-world search queries (from legal, marketing, environmental, and government domains). This benchmark addresses the gap in previous evaluations by covering realistic, multimodal, and multilingual information-seeking scenarios.
Model | Jina-VDR (nDCG@5) |
---|---|
BM25+OCR | 45.5 |
jina-emb-v3+OCR | 48.1 |
jina-clip-v2 | 40.2 |
ColPALI | 65.5 |
DSE-Qwen2-2b | 48.6 |
jev4-single | 73.9 |
jev4-multi | 80.2 |
7. Summary of Capabilities and Impact
Jina-Embeddings-v4 represents a substantial advance in universal retrieval, characterized by:
- A unified transformer model aligned for both text and image (including mixed-format) content.
- Flexible output options: dense and multi-vector, with support for late interaction.
- Modular, efficient task adaptation via LoRA adapters.
- State-of-the-art results in both standard IR and challenging visually rich, multilingual, and multi-format document retrieval settings, as evidenced by performance on Jina-VDR, ViDoRe, and other established benchmarks.
- Enhanced cross-modal semantic coherence: substantial improvements in modality alignment and complex content processing relative to prior art.
This model establishes a reference point for future research and development in universal retrieval systems, particularly those serving real-world domains where text and graphics intermingle and retrieval performance across modalities and languages is paramount.