SDS KoPub VDR: Korean Document Retrieval Benchmark
- SDS KoPub VDR is a comprehensive benchmark providing fine-grained evaluation for visual document retrieval in Korean public documents.
- It features a dataset of 361 PDFs (40,781 pages) with complex visual elements such as tables, charts, and diagrams, enabling modality-aware testing.
- The benchmark supports both text-only and multimodal retrieval tasks, demonstrating significant gains in Recall metrics across diverse domains.
SDS KoPub VDR is a large-scale, publicly available benchmark for evaluating Visual Document Retrieval (VDR) in complex, real-world Korean public documents. It is designed to rigorously test retrieval models in scenarios where documents exhibit dense visual information—such as tables, charts, diagrams, and complex multi-column layouts—and are written in a non-English language. The dataset enables modality-aware, fine-grained evaluation of both text-only and multimodal retrieval systems, explicitly targeting a gap left by previous benchmarks which focus predominantly on English and on plain text or single-page visual reasoning.
1. Motivation and Problem Setting
Retrieval-Augmented Generation (RAG) frameworks rely on effective retrieval components to ground neural outputs in external knowledge collections. However, the reliability of such systems is contingent on the retrieval model’s competency across diverse document genres and modalities. Prior benchmarks, including SQuAD and Dense Passage Retrieval (DPR), are limited to English and primarily plain text scenarios. Existing Document Visual Question Answering (VQA) resources, such as DocVQA and InfographicVQA, assess single-page comprehension without addressing retrieval at scale or in non-English languages. Korean public documents—such as government white papers, legal guidelines, and statistical yearbooks—demand sophisticated approaches due to their mixture of visual and textual elements and specialized domain vocabulary. SDS KoPub VDR directly addresses these challenges by offering a standardized testbed for VDR in a non-English, visually intricate setting.
2. Dataset Construction and Statistics
SDS KoPub VDR comprises 361 PDF files totaling 40,781 pages, systematically curated from multiple official sources:
- 256 documents (KOGL Type 1) from national/local public agencies.
- 105 documents from official legal portals, including the Ministry of Government Legislation.
Each PDF is split into pages, with (file_id, page_id) tuples serving as unique retrieval units. The dataset spans six public domains: society, environment, education, industry, diplomacy, and finance. The visual element breakdown of all pages is as follows:
| Page Category | Number of Pages | Percentage |
|---|---|---|
| Pure text | 15,231 | 37.3% |
| Tables only | 16,340 | 40.1% |
| Tables + figures | 3,380 | 8.3% |
| Charts/graphs | 7,088 | 17.4% |
| Diagrams | 1,201 | 2.9% |
| Photographs | 921 | 2.3% |
This distribution, where 62.7% of documents are visually rich, establishes a demanding scenario for models and advances the state of the art in VDR assessment.
3. Evaluation Queries and Reasoning Modalities
The benchmark includes a held-out evaluation set of 600 query–page–answer triples, constructed as follows:
- Automated Query Generation: Multimodal LLMs (GPT-4o, Qwen2.5-VL-72B) generated candidate QA triples using a blend of instruction-based, persona-augmented, and dynamic few-shot prompting.
- Automated Filtering: Retriever-based (BM25) filtering and semantic grounding via LLMs (GPT-4.5) were employed to eliminate ungrounded or redundant queries.
- Manual Verification: Domain experts conducted thorough checks for clarity, factual faithfulness, and page referencing accuracy.
Queries are uniformly divided across the six domains (100 per domain) and classified according to required reasoning modality:
- Text-only (103 queries): Answered exclusively via paragraph-level textual content.
- Visual (161 queries): Resolved through tables, graphs, or diagrams alone.
- Cross-modal (336 queries): Necessitate integrated reasoning over both text and visual content.
This structure allows for precise attribution of model weaknesses to specific reasoning modalities.
4. Retrieval Tasks, Methodologies, and Metrics
SDS KoPub VDR supports two principal retrieval tasks:
Task 1: Text-only Retrieval
Pages and queries are embedded using a text-only encoder , with document text extracted by OCR. Given query and document page text , similarity is computed as:
All embeddings are indexed using FAISS, and ranking occurs by cosine similarity.
Task 2: Multimodal Retrieval
Each page is mapped to a joint embedding that fuses visual arrangements and textual content. Queries are projected into the multimodal embedding space:
Ranking is also performed via FAISS on multimodal embeddings.
Evaluation metrics are standard information retrieval measures:
- Mean Reciprocal Rank (MRR):
- Recall@k:
5. Baselines and Experimental Results
SDS KoPub VDR was used to benchmark several retrieval models, including:
- Four multilingual text-only encoders: BGE-M3, Kanana-Nano-2.1B, Qwen3-0.6B, OpenAI text-embedding-3-large.
- Three off-the-shelf multimodal encoders: DSE-Qwen2-2B-MRL, Nomic-Embed-Multimodal-7B, Jina-Embeddings-v4.
- A custom multimodal model: SDS-Multimodal-Embedding-7B (Qwen2.5-VL-7B, fine-tuned on Korean public sources).
Task 1: Text-only Retrieval (Recall@k, best model SDS-Multimodal-Embedding-7B)
- R@1 = 0.54
- R@3 = 0.77
- R@5 = 0.83
- R@10 = 0.89
Comparative performances:
- Jina-Embeddings-v4 R@3 = 0.71
- BGE-M3 R@3 = 0.68
- Kanana-Nano R@3 = 0.66
Task 2: Multimodal Retrieval (Recall@k, best model SDS-Multimodal-Embedding-7B)
- R@1 = 0.63
- R@3 = 0.86
- R@5 = 0.90
- R@10 = 0.95
Comparative multimodal model scores:
- Nomic-Embed-7B R@5 = 0.74
- Jina-v4 R@5 = 0.74
- DSE-Qwen2-2B R@5 = 0.46
Key empirical findings include:
- An 8.4% absolute gain in Recall@5 when incorporating page images for identical architectures.
- Text-only retrieval lags—particularly on visual and cross-modal queries (R@1 ≈ 0.38–0.46), whereas multimodal retrieval performs substantially better (Visual queries R@1 ≈ 0.50–0.58, R@3 ≈ 0.84–0.86).
- Visually dense domains (Finance, Diplomacy, Environment) exhibit the most pronounced improvements from multimodal indexing, underscoring the crucial semantics encoded in color, layout, and visual elements.
6. Applications and Research Significance
The resource includes a reproducible pipeline for PDF preprocessing, QA metadata construction, and open licensing, facilitating deployment within novel retrieval and RAG workflows. SDS KoPub VDR supports:
- Government transparency platforms for targeted passage retrieval from policy documents.
- Legal information systems focused on statutes, regulations, and case law.
- Public administrative assistants that synthesize heterogeneous data modalities in response to complex information requests.
Its non-English, visually complex focus extends the evaluation landscape for VDR systems and enables analysis under challenging, real-world scenarios.
7. Limitations and Prospective Expansion
Current limitations include:
- Dataset scale: 600 single-hop QA pairs across six domains.
- Query realism: reliance on machine-generated prompts may limit coverage of colloquial or user-driven query distributions.
- Retrieval complexity: evaluation is restricted to single-page retrieval with no multi-hop or cross-document reasoning tasks.
Future expansions are planned to:
- Increase QA pair coverage into the thousands, introducing new areas such as healthcare and defense.
- Develop multi-hop and cross-document retrieval under noisy OCR conditions.
- Launch a public leaderboard and transition toward an end-to-end multimodal RAG benchmark pipeline, encompassing the full answer generation lifecycle.
In summary, SDS KoPub VDR sets a new baseline for the evaluation of visual document retrieval in non-English, highly visual official documents, revealing significant headroom for cross-modal modeling and retrieval, and providing a roadmap for future research and system development in multimodal document intelligence.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free