NL-DIR: Natural Language Document Image Retrieval

Updated 30 December 2025

NL-DIR is a document retrieval paradigm that maps both images and natural language queries into a shared embedding space for fine-grained matching.
The method leverages contrastive vision-language models and OCR-free visual models to handle diverse datasets, including 41,795 document images with 205,000 queries across 247 categories.
It employs a two-stage pipeline with fast recall using FAISS and cross-attention re-ranking to enhance precision and efficiency in retrieval tasks.

Natural Language-based Document Image Retrieval (NL-DIR) is a paradigm in document retrieval that targets the retrieval of document images from a corpus using semantically rich natural language queries. Unlike classic DIR approaches—often relying on image queries and retrieving only within coarse semantic categories—NL-DIR operates on real-world queries that express fine-grained details, intent, and visual elements, thereby facilitating retrieval scenarios better aligned with user needs in contemporary information management systems (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).

1. Formal Problem Definition

NL-DIR is defined over a corpus $\mathcal{D} = \{d_1, \ldots, d_N\}$ of document images and a query set $\mathcal{Q} = \{q_1, \ldots, q_M\}$ comprised of natural-language prompts. Two embedding functions are learned:

$f_d: \mathcal{D} \to \mathbb{R}^D, \qquad f_q: \mathcal{Q} \to \mathbb{R}^D$

These map both images and queries into a shared $D$ -dimensional latent space. Retrieval for a query $q$ consists of finding:

$d^* = \arg\max_{d \in \mathcal{D}} \mathrm{sim}(f_q(q), f_d(d))$

where $\mathrm{sim}(\cdot, \cdot)$ is typically dot-product or cosine similarity (Guo et al., 23 Dec 2025). In multi-lingual settings, each query $q \in Q$ has a ground-truth page $d^*_q \in D$ such that:

$\text{top}_k(q) = \text{argmax}_{j \in D}^{k} s(q, d_j)$

with $s: Q \times D \to \mathbb{R}$ a scoring function; the retrieval task seeks $d^*_q \in \text{top}_k(q)$ (Osmulski et al., 16 May 2025).

2. Datasets and Annotation Protocols

Composition: 41,795 authentic scanned document images, each paired with five distinctive, fine-grained semantic queries, yielding $\sim$ 205,000 queries.
Categories: 247 classes, e.g., letters, forms, reports, advertisements, charts (with 15 largest categories detailed).
Splits: Train/val/test (8:1:1), with 4,180 test images serving as the benchmark.
Query Generation: Automated OCR extraction with layout cues, LLM-driven query synthesis (ChatGPT produces 10 candidates/image), scored by ChatGPT, Qwen-VL-Plus, CLIP, and BLIP. Weighted aggregation (3:3:2:2), followed by manual inspection and curation.
Semantic Richness: High query-OCR text overlap (≥3 words), but queries generalize to encompass layout, visuals, and intent beyond raw text.

Corpus: Multilingual, spanning 18 languages (English, Chinese, Arabic, Hindi, etc.), $\sim$ 7,900 queries, and $\sim$ 18,819 document page images per language.
Source: First-page screenshots of Wikipedia articles; queries are human-generated (native speaker annotation).
Negative Sampling: “Easy” negatives excluded via dense multilingual text embedding ranks, maintaining retrieval challenge.
Annotation Process: Ensures one ground-truth page per query, with diverse queries not limited to paraphrases.

3. Retrieval Pipeline Architectures

Baseline Modalities

Contrastive Vision-LLMs (VLMs) (Guo et al., 23 Dec 2025): Two-tower architectures (CLIP, BLIP, DFN, SigLIP, InternVL-14B):
- Pretrained with InfoNCE or Sigmoid contrastive loss, mainly on natural images and some web/OCR data.
OCR-Free Visual Document Understanding (VDU) Models: (Donut, Nougat, Pix2Struct, DocOwl1.5, UReader, TextMonkey, Qwen2-VL, DSE, ColPali)
- Generative decoder-only models aimed at structured output or text extraction without explicit contrastive alignment.
- Adapted by mean-pooled embeddings and dot-product scoring for zero-shot retrieval.

Text-only: Dense text encoders (multilingual-e5-large, bge-m3, gte-multilingual-base, Llama-3 tuned for embeddings).
Visual LLM (VLM): Vision encoder (e.g., SigLIP) for page images; text encoder for query; projection MLP for shared embedding alignment; contrastive pair training.
Precomputed Embeddings: Use FAISS for fast top- $k$ candidate selection via cosine similarity.

4. Two-Stage Retrieval and Re-Ranking

Precompute $f_d(d)$ for all gallery images (FAISS index, $\sim$ 0.07 $\,$ s/image, %%%%19 $q$ 20%%%%KB/image).
For query $q$ , encode $f_q(q)$ , retrieve top-100 candidates by dot-product similarity.

Fine Re-Ranking

For top-100 candidates, apply cross-modal interaction models for refined scoring $s_\mathrm{cross}(q, d)$ .
Models: BLIP-ITM, Pix2Struct+ITM with additional cross-attention, trained with hard-negative mining (top-10 hard negatives; mix pointwise and pairwise loss).
Reranking $\sim$ 0.2 $\,$ s/query for 100 images (parallel processing).

Retrieval pseudocode:

for q in test_queries:
    C = recall_top100(q)
    for d in C:
        s[d] = CrossAttentionITM(f_q(q), f_img(d))
    results = sort_by_score(C, s)

5. Evaluation Metrics and Experimental Comparisons

Metrics

Recall@ $K$ : $\frac{1}{|\mathcal{Q}|} \sum_q \mathbf{1}[GT_d(q) \in \text{top-K}(q)]$
MRR@ $K$ : $\frac{1}{|\mathcal{Q}|} \sum_q \frac{1}{\text{rank}_q}$ (truncated at $K$ )
mAP and NDCG@ $K$ (optional; standard IR metrics) (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).

Zero-shot: SigLIP-So400m $R@1$ =36.17, $R@10$ =61.18, MRR@10=43.78. InternVL-14B yields lower recall. VDU models have near-zero recall.
Fine-tuning: SigLIP-Text-LoRA achieves $R@1$ =54.3, $R@10$ =79.4; SigLIP-Image-LoRA achieves $R@1$ =69.3, $R@10$ =89.7. Best $R@100$ =97.52.
Re-ranking: ColPali (zero-shot) $R@1$ =79.6, $R@10$ =91.6, MRR=83.8. Fine-tuned BLIP-ITM matches ColPali with fourfold lower storage.
Subset Analysis: Concrete queries yield higher recall ( $R@1$ =86.2, $R@10$ =97.2, $R@10$ =99.4 after rerank); abstract queries are less effectively retrieved.

Best text embedding: NDCG@10=0.7964.
Best VLM: NDCG@10=0.5283 (42% lower). On English, text vs. vision gap is 12.1%; in Hindi/Arabic up to 59.7%.
VLM performance degrades sharply in non-Latin scripts and resource-poor languages.

Language	Text NDCG@10	Vision NDCG@10	Drop (%)
Arabic	0.8883	0.4888	45.0
Hindi	0.7581	0.3127	58.8
English	0.7348	0.6784	7.7
Telugu	0.9090	0.0893	90.2

6. Efficiency and Practical Considerations

Offline encoding: SigLIP $\sim$ 0.07 $\,$ s/image, OCR-IR (Tesseract+BGE) $\sim$ 2.56 $\,$ s, DSE $\sim$ 0.62 $\,$ s, ColPali $\sim$ 0.65 $\,$ s per image (Guo et al., 23 Dec 2025).
Storage: SigLIP 4 KB/image; BGE 3 KB; DSE 6 KB; ColPali 256 KB.
Online query/rerank: Dot-product selection is sub-ms; reranking 0.2 s for 100 images.
MIRACL-VISION employs aggressive hard-negative mining to render evaluation manageable yet challenging (Osmulski et al., 16 May 2025).

7. Challenges, Insights, and Future Directions

Key Technical Obstacles

Fine-grained semantics require models robust to domain shift from natural images to complex layouts, error-prone OCR, and visually rich content.
Multilingual VLMs lag text-only models, with retrieval efficiency and semantic alignment suffering in non-Latin scripts due to training data gaps and OCR limitations.

Observations

Contrastive VLMs pretrained on image–OCR pairs (SigLIP) excel in zero-shot setting; cross-attention layers in reranking crucially improve fine-grained matching (Guo et al., 23 Dec 2025).
OCR-free VDU models outperform when visual information dominates (e.g., charts, handwriting, cartoons).

Build benchmarks from human-annotated queries, ensuring linguistic and script diversity; couple image and text versions for cross-modal gap analysis.
Incorporate hard-negative mining to maintain retrieval challenge and tractability.
Improve VLM accuracy via domain-adaptive multilingual fine-tuning, multilingual page image augmentation, OCR-aware vision encoder training, and hybrid fusion of visual/text embeddings.

Develop specialized, lightweight cross-modal alignment architectures tuned for document genres.
Integrate generative VDU and cross-attention in unified rerankers.
Extend to multi-page retrieval and paragraph-level localization.
Scale datasets and pre-training on document-centric corpora to bridge existing domain gaps.
This suggests that addressing semantic and script diversity, while reducing reliance on synthetic LLM-generated queries, is essential for robust, real-world NL-DIR deployment.

Summary

Natural Language-based Document Image Retrieval represents a significant advancement for document access via expressive natural language, characterized by challenging datasets, precise evaluation protocols, and sophisticated cross-modal architectures. The NL-DIR and MIRACL-VISION benchmarks provide comprehensive testbeds, reflecting real-world complexity and multilingual reach. Experimental and efficiency analyses reveal contrastive VLMs and two-stage retrieval pipelines as current state-of-the-art, with clear opportunities for progress in multilingual modeling, fine-grained semantics, and scalable architecture design (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).

PDF Markdown Chat (Pro)

References (2)

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark (2025)

MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Natural Language-based Document Image Retrieval (NL-DIR).

NL-DIR: Natural Language Document Image Retrieval

1. Formal Problem Definition

2. Datasets and Annotation Protocols

NL-DIR Dataset (Guo et al., 23 Dec 2025)

MIRACL-VISION Benchmark (Osmulski et al., 16 May 2025)

3. Retrieval Pipeline Architectures

Baseline Modalities

Multilingual Pipeline (MIRACL-VISION) (Osmulski et al., 16 May 2025)

4. Two-Stage Retrieval and Re-Ranking

Coarse Recall (Guo et al., 23 Dec 2025)

Fine Re-Ranking

5. Evaluation Metrics and Experimental Comparisons

Metrics

Results on NL-DIR (Guo et al., 23 Dec 2025)

Multilingual Gap in MIRACL-VISION (Osmulski et al., 16 May 2025)

6. Efficiency and Practical Considerations

7. Challenges, Insights, and Future Directions

Key Technical Obstacles

Observations

Recommendations (Osmulski et al., 16 May 2025)

Future Research Directions (Guo et al., 23 Dec 2025)

Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

NL-DIR: Natural Language Document Image Retrieval

1. Formal Problem Definition

2. Datasets and Annotation Protocols

NL-DIR Dataset (Guo et al., 23 Dec 2025)

MIRACL-VISION Benchmark (Osmulski et al., 16 May 2025)

3. Retrieval Pipeline Architectures

Baseline Modalities

Multilingual Pipeline (MIRACL-VISION) (Osmulski et al., 16 May 2025)

4. Two-Stage Retrieval and Re-Ranking

Coarse Recall (Guo et al., 23 Dec 2025)

Fine Re-Ranking

5. Evaluation Metrics and Experimental Comparisons

Metrics

Results on NL-DIR (Guo et al., 23 Dec 2025)

Multilingual Gap in MIRACL-VISION (Osmulski et al., 16 May 2025)

6. Efficiency and Practical Considerations

7. Challenges, Insights, and Future Directions

Key Technical Obstacles

Observations

Recommendations (Osmulski et al., 16 May 2025)

Future Research Directions (Guo et al., 23 Dec 2025)

Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics