Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jina-VDR Benchmark

Updated 1 July 2025
  • The Jina-VDR Benchmark is a comprehensive multilingual and multimodal evaluation suite for assessing visually rich document retrieval across 30+ real-world tasks.
  • It incorporates diverse document modalities like images, tables, charts, and scanned PDFs, alongside multilingual queries and documents, using metrics like nDCG@5.
  • Jina-VDR advances evaluation rigor by reflecting the complexity of contemporary global information retrieval, serving as a critical tool for model assessment and progress.

The Jina-VDR Benchmark is a comprehensive multilingual and multimodal evaluation suite designed to assess visually rich document retrieval across an extensive set of real-world tasks. Introduced alongside the jina-embeddings-v4 universal embedding model, Jina-VDR advances the evaluation of retrieval systems by incorporating complexity in both document modalities (images, charts, maps, diagrams, tables, rendered markdown, scanned PDFs) and supported languages, thereby reflecting the practical requirements of information retrieval in contemporary, globally distributed, and semantically diverse environments (2506.18902).

1. Scope and Construction of the Jina-VDR Benchmark

Jina-VDR is constructed as a novel, large-scale benchmark with the explicit goal of evaluating the capabilities of embedding models on realistic, visually and linguistically diverse retrieval tasks. Unlike legacy standards such as ViDoRe—which evaluated mostly English/French question-answering over tables and charts—Jina-VDR introduces over 30 new retrieval tasks. These encompass multiple domains, including legal, governmental, financial, academic, software/IT, housing, entertainment, and scientific materials.

The benchmark features:

  • Multilingual queries and documents spanning up to 20 languages in certain datasets (e.g., Github Readme, Wikimedia Commons Documents, AirBnB, TweetStock).
  • Document modalities such as fully rendered markdown docs, complex table layouts, scanned pages, infographics, advertisements, technical diagrams, maps, and page screenshots.
  • Query types ranging from descriptive searches and instructions to facts, extending beyond traditional span-based question answering.

Data sources are drawn from a combination of repurposed VQA/OCR datasets, custom manual annotation, and synthetic generation to ensure coverage in domains where high-quality human annotation is lacking. Each retrieval task is formulated as a match between a text query and a document image, aligning evaluation with actual use patterns in enterprise and public sector document search.

Dataset Name Domain Document Format Query Format Languages
github-readme-retrieval-multilingual Software/IT Markdown docs Description ar, bn, de, en, ...
wikimedia-commons-documents-ml Mixed Mixed Description ar, bn, de, en, ...
airbnb-synthetic-retrieval Housing Tables Instruction ar, de, en, es, ...

2. Model Architecture and Support for Jina-VDR

The principal model evaluated on Jina-VDR is jina-embeddings-v4, built on the Qwen2.5-VL multimodal LLM backbone. The architecture diverges from CLIP-style dual-encoder approaches by using a unified transformer accepting both tokenized text and discretized image vectors within the same sequential input space.

Distinctive architectural features include:

  • Processing paths for text (token embedding and transformer encoding) and image (discrete image tokenization and transformer encoding), supporting input sequences up to 32,768 text tokens and 20 megapixel images.
  • Output in two forms: a 2048-dimensional single-vector (truncatable via Matryoshka Representation Learning) and a 128-dimensional vector per input token/patch for use in late interaction retrieval scenarios.
  • Task-specific Low-Rank Adaptation (LoRA) adapters—compact (60M parameter) modules trained for various retrieval settings: asymmetric (query vs. document), symmetric (semantic similarity), and code (natural language-to-code or code-to-code).

Mathematically, the late interaction similarity score between query and document embeddings qi\bm{q}_i and pj\bm{p}_j is defined as: slate(q,p)=i=1nmaxj{1,,m}qipjTs_\mathrm{late}(q, p) = \sum_{i=1}^n \max_{j \in \{1,\ldots,m\}} \bm{q}_i \cdot \bm{p}_j^T This approach enables precise region-level matching between query semantics and local visual features in complex documents.

3. Evaluation Metrics and Performance

The benchmark employs standard retrieval performance metrics:

  • nDCG@5nDCG@5 and nDCG@10nDCG@10 for ranked retrieval evaluation,
  • Recall@5 for cross-modal (CLIP-style) tasks,
  • Spearman correlation for semantic similarity (STS) scenarios.

In Jina-VDR, jina-embeddings-v4 achieves state-of-the-art results:

  • nDCG@5nDCG@5: 72.19 (single-vector); 79.29 (multi-vector/late interaction)
  • ViDoRe (legacy benchmark): 84.11 (single-vector); 90.17 (late interaction)
  • Wikimedia Commons (multilingual sub-benchmark): 65.79 (single); 74.50 (multi-vector)

Significant improvements are observed over OCR+BM25, CLIP, and ColPali-style VDR baselines, particularly in visually rich, structurally complex, or non-English datasets. This suggests that late interaction with multi-vector embeddings in a unified transformer space substantially enhances fine-grained retrieval in multi-modal, cross-domain scenarios.

4. LoRA Adapters and Task Specialization

LoRA adapters are integral to model adaptation without the overhead of retraining the entire backbone. Each adapter module is fine-tuned for:

  • Asymmetric query-document retrieval (improving tasks with short queries vs. long documents, default for Jina-VDR)
  • Symmetric semantic similarity, suitable for STS or paraphrase scenarios
  • Code retrieval, for NL-to-code or code-to-code search

Adapters are selected at runtime for a given retrieval task. This modularity enables optimal task performance and reusability across domains. The backbone remains frozen, enabling efficient training and inference cycles and minimizing additional memory costs (well below 2% footprint compared to backbone).

5. Cross-Modal and Multilingual Retrieval Capabilities

Jina-VDR explicitly challenges models to operate across modalities and languages. The unified architecture of jina-embeddings-v4, in which image and text enter the same LLM transformer and are output jointly, reduces the modality gap seen in dual-encoder systems. The model achieves:

  • Strong cross-modal alignment (evident by cosine similarity and separation of positive vs. negative matches)
  • Robust performance on retrieval tasks with queries and documents in mismatched or non-English languages
  • Effective handling of heterogeneous, mixed-media inputs typical of real-world document search

On tasks like rendered charts with textual instructions, the model demonstrates effective region-level alignment due to late-interaction multi-vector outputs. This capability is critical for visually-rich, real-world document retrieval beyond simple image-caption matching.

6. Applications and Impact

Jina-VDR enables rigorous evaluation of models in use cases such as:

  • Enterprise and legal retrieval of scanned reports, filings, or government standards
  • Academic cross-lingual retrieval in digital libraries spanning multimodal lecture slides and scientific diagrams
  • Technical and financial intelligence where tables, figures, mixed layouts, and linguistic diversity are the norm

By raising the bar for retrieval tasks—from single-modality, monolingual, or span-answering toward universal, visually-aware, format-agnostic, and multilingual retrieval—Jina-VDR establishes a new reference point for both research and industrial model selection. Open access to both the benchmark and the model encourages further methodological innovation, adaptation, and practical deployment across domains encountering visually complex, language-diverse information.

Aspect Detail
Design Multilingual, visually-rich retrieval across 30+ diverse tasks, including images, charts, tables, and scans
Metrics nDCG@5/10, Recall@5, STS Spearman
Model jina-embeddings-v4, unified transformer; LoRA adapters for asymmetric/symmetric/code search
Impact Advances the field by providing realistic, rigorous evaluation for universal, multimodal, multilingual retrieval

Jina-VDR thus serves as a critical tool for the robust assessment and progress of retrieval architectures capable of handling the diversity and complexity characteristic of contemporary information ecosystems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)