Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

96 tokens/sec

Gemini 2.5 Pro Premium

51 tokens/sec

GPT-5 Medium

36 tokens/sec

GPT-5 High Premium

34 tokens/sec

GPT-4o

96 tokens/sec

DeepSeek R1 via Azure Premium

91 tokens/sec

GPT OSS 120B via Groq Premium

466 tokens/sec

Kimi K2 via Groq Premium

148 tokens/sec

2000 character limit reached

Jina-VDR Benchmark

Updated 1 July 2025

The Jina-VDR Benchmark is a comprehensive multilingual and multimodal evaluation suite for assessing visually rich document retrieval across 30+ real-world tasks.
It incorporates diverse document modalities like images, tables, charts, and scanned PDFs, alongside multilingual queries and documents, using metrics like nDCG@5.
Jina-VDR advances evaluation rigor by reflecting the complexity of contemporary global information retrieval, serving as a critical tool for model assessment and progress.

The Jina-VDR Benchmark is a comprehensive multilingual and multimodal evaluation suite designed to assess visually rich document retrieval across an extensive set of real-world tasks. Introduced alongside the jina-embeddings-v4 universal embedding model, Jina-VDR advances the evaluation of retrieval systems by incorporating complexity in both document modalities (images, charts, maps, diagrams, tables, rendered markdown, scanned PDFs) and supported languages, thereby reflecting the practical requirements of information retrieval in contemporary, globally distributed, and semantically diverse environments (Günther et al., 23 Jun 2025).

1. Scope and Construction of the Jina-VDR Benchmark

Jina-VDR is constructed as a novel, large-scale benchmark with the explicit goal of evaluating the capabilities of embedding models on realistic, visually and linguistically diverse retrieval tasks. Unlike legacy standards such as ViDoRe—which evaluated mostly English/French question-answering over tables and charts—Jina-VDR introduces over 30 new retrieval tasks. These encompass multiple domains, including legal, governmental, financial, academic, software/IT, housing, entertainment, and scientific materials.

The benchmark features:

Multilingual queries and documents spanning up to 20 languages in certain datasets (e.g., Github Readme, Wikimedia Commons Documents, AirBnB, TweetStock).
Document modalities such as fully rendered markdown docs, complex table layouts, scanned pages, infographics, advertisements, technical diagrams, maps, and page screenshots.
Query types ranging from descriptive searches and instructions to facts, extending beyond traditional span-based question answering.

Data sources are drawn from a combination of repurposed VQA/OCR datasets, custom manual annotation, and synthetic generation to ensure coverage in domains where high-quality human annotation is lacking. Each retrieval task is formulated as a match between a text query and a document image, aligning evaluation with actual use patterns in enterprise and public sector document search.

Dataset Name	Domain	Document Format	Query Format	Languages
github-readme-retrieval-multilingual	Software/IT	Markdown docs	Description	ar, bn, de, en, ...
wikimedia-commons-documents-ml	Mixed	Mixed	Description	ar, bn, de, en, ...
airbnb-synthetic-retrieval	Housing	Tables	Instruction	ar, de, en, es, ...

2. Model Architecture and Support for Jina-VDR

The principal model evaluated on Jina-VDR is jina-embeddings-v4, built on the Qwen2.5-VL multimodal LLM backbone. The architecture diverges from CLIP-style dual-encoder approaches by using a unified transformer accepting both tokenized text and discretized image vectors within the same sequential input space.

Distinctive architectural features include:

Processing paths for text (token embedding and transformer encoding) and image (discrete image tokenization and transformer encoding), supporting input sequences up to 32,768 text tokens and 20 megapixel images.
Output in two forms: a 2048-dimensional single-vector (truncatable via Matryoshka Representation Learning) and a 128-dimensional vector per input token/patch for use in late interaction retrieval scenarios.
Task-specific Low-Rank Adaptation (LoRA) adapters—compact (60M parameter) modules trained for various retrieval settings: asymmetric (query vs. document), symmetric (semantic similarity), and code (natural language-to-code or code-to-code).

Mathematically, the late interaction similarity score between query and document embeddings $\bm{q}_i$ and $\bm{p}_j$ is defined as: $s_\mathrm{late}(q, p) = \sum_{i=1}^n \max_{j \in \{1,\ldots,m\}} \bm{q}_i \cdot \bm{p}_j^T$ This approach enables precise region-level matching between query semantics and local visual features in complex documents.

3. Evaluation Metrics and Performance

The benchmark employs standard retrieval performance metrics:

$nDCG@5$ and $nDCG@10$ for ranked retrieval evaluation,
Recall@5 for cross-modal (CLIP-style) tasks,
Spearman correlation for semantic similarity (STS) scenarios.

In Jina-VDR, jina-embeddings-v4 achieves state-of-the-art results:

$nDCG@5$ : 72.19 (single-vector); 79.29 (multi-vector/late interaction)
ViDoRe (legacy benchmark): 84.11 (single-vector); 90.17 (late interaction)
Wikimedia Commons (multilingual sub-benchmark): 65.79 (single); 74.50 (multi-vector)

Significant improvements are observed over OCR+BM25, CLIP, and ColPali-style VDR baselines, particularly in visually rich, structurally complex, or non-English datasets. This suggests that late interaction with multi-vector embeddings in a unified transformer space substantially enhances fine-grained retrieval in multi-modal, cross-domain scenarios.

4. LoRA Adapters and Task Specialization

LoRA adapters are integral to model adaptation without the overhead of retraining the entire backbone. Each adapter module is fine-tuned for:

Asymmetric query-document retrieval (improving tasks with short queries vs. long documents, default for Jina-VDR)
Symmetric semantic similarity, suitable for STS or paraphrase scenarios
Code retrieval, for NL-to-code or code-to-code search

Adapters are selected at runtime for a given retrieval task. This modularity enables optimal task performance and reusability across domains. The backbone remains frozen, enabling efficient training and inference cycles and minimizing additional memory costs (well below 2% footprint compared to backbone).

Jina-VDR explicitly challenges models to operate across modalities and languages. The unified architecture of jina-embeddings-v4, in which image and text enter the same LLM transformer and are output jointly, reduces the modality gap seen in dual-encoder systems. The model achieves:

Strong cross-modal alignment (evident by cosine similarity and separation of positive vs. negative matches)
Robust performance on retrieval tasks with queries and documents in mismatched or non-English languages
Effective handling of heterogeneous, mixed-media inputs typical of real-world document search

On tasks like rendered charts with textual instructions, the model demonstrates effective region-level alignment due to late-interaction multi-vector outputs. This capability is critical for visually-rich, real-world document retrieval beyond simple image-caption matching.

6. Applications and Impact

Jina-VDR enables rigorous evaluation of models in use cases such as:

Enterprise and legal retrieval of scanned reports, filings, or government standards
Academic cross-lingual retrieval in digital libraries spanning multimodal lecture slides and scientific diagrams
Technical and financial intelligence where tables, figures, mixed layouts, and linguistic diversity are the norm

By raising the bar for retrieval tasks—from single-modality, monolingual, or span-answering toward universal, visually-aware, format-agnostic, and multilingual retrieval—Jina-VDR establishes a new reference point for both research and industrial model selection. Open access to both the benchmark and the model encourages further methodological innovation, adaptation, and practical deployment across domains encountering visually complex, language-diverse information.

Aspect	Detail
Design	Multilingual, visually-rich retrieval across 30+ diverse tasks, including images, charts, tables, and scans
Metrics	nDCG@5/10, Recall@5, STS Spearman
Model	jina-embeddings-v4, unified transformer; LoRA adapters for asymmetric/symmetric/code search
Impact	Advances the field by providing realistic, rigorous evaluation for universal, multimodal, multilingual retrieval

Jina-VDR thus serves as a critical tool for the robust assessment and progress of retrieval architectures capable of handling the diversity and complexity characteristic of contemporary information ecosystems.

PDF Markdown Chat (Upgrade)

References (1)

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval (2025)

Jina-VDR Benchmark

1. Scope and Construction of the Jina-VDR Benchmark

2. Model Architecture and Support for Jina-VDR

3. Evaluation Metrics and Performance

4. LoRA Adapters and Task Specialization

5. Cross-Modal and Multilingual Retrieval Capabilities

6. Applications and Impact

Follow-up Questions

Related Topics