NL-DIR: Natural Language Document Image Retrieval
- NL-DIR is a document retrieval paradigm that maps both images and natural language queries into a shared embedding space for fine-grained matching.
- The method leverages contrastive vision-language models and OCR-free visual models to handle diverse datasets, including 41,795 document images with 205,000 queries across 247 categories.
- It employs a two-stage pipeline with fast recall using FAISS and cross-attention re-ranking to enhance precision and efficiency in retrieval tasks.
Natural Language-based Document Image Retrieval (NL-DIR) is a paradigm in document retrieval that targets the retrieval of document images from a corpus using semantically rich natural language queries. Unlike classic DIR approaches—often relying on image queries and retrieving only within coarse semantic categories—NL-DIR operates on real-world queries that express fine-grained details, intent, and visual elements, thereby facilitating retrieval scenarios better aligned with user needs in contemporary information management systems (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).
1. Formal Problem Definition
NL-DIR is defined over a corpus of document images and a query set comprised of natural-language prompts. Two embedding functions are learned:
These map both images and queries into a shared -dimensional latent space. Retrieval for a query consists of finding:
where is typically dot-product or cosine similarity (Guo et al., 23 Dec 2025). In multi-lingual settings, each query has a ground-truth page such that:
with a scoring function; the retrieval task seeks (Osmulski et al., 16 May 2025).
2. Datasets and Annotation Protocols
NL-DIR Dataset (Guo et al., 23 Dec 2025)
- Composition: 41,795 authentic scanned document images, each paired with five distinctive, fine-grained semantic queries, yielding 205,000 queries.
- Categories: 247 classes, e.g., letters, forms, reports, advertisements, charts (with 15 largest categories detailed).
- Splits: Train/val/test (8:1:1), with 4,180 test images serving as the benchmark.
- Query Generation: Automated OCR extraction with layout cues, LLM-driven query synthesis (ChatGPT produces 10 candidates/image), scored by ChatGPT, Qwen-VL-Plus, CLIP, and BLIP. Weighted aggregation (3:3:2:2), followed by manual inspection and curation.
- Semantic Richness: High query-OCR text overlap (≥3 words), but queries generalize to encompass layout, visuals, and intent beyond raw text.
MIRACL-VISION Benchmark (Osmulski et al., 16 May 2025)
- Corpus: Multilingual, spanning 18 languages (English, Chinese, Arabic, Hindi, etc.), 7,900 queries, and 18,819 document page images per language.
- Source: First-page screenshots of Wikipedia articles; queries are human-generated (native speaker annotation).
- Negative Sampling: “Easy” negatives excluded via dense multilingual text embedding ranks, maintaining retrieval challenge.
- Annotation Process: Ensures one ground-truth page per query, with diverse queries not limited to paraphrases.
3. Retrieval Pipeline Architectures
Baseline Modalities
- Contrastive Vision-LLMs (VLMs) (Guo et al., 23 Dec 2025): Two-tower architectures (CLIP, BLIP, DFN, SigLIP, InternVL-14B):
- Pretrained with InfoNCE or Sigmoid contrastive loss, mainly on natural images and some web/OCR data.
- OCR-Free Visual Document Understanding (VDU) Models: (Donut, Nougat, Pix2Struct, DocOwl1.5, UReader, TextMonkey, Qwen2-VL, DSE, ColPali)
- Generative decoder-only models aimed at structured output or text extraction without explicit contrastive alignment.
- Adapted by mean-pooled embeddings and dot-product scoring for zero-shot retrieval.
Multilingual Pipeline (MIRACL-VISION) (Osmulski et al., 16 May 2025)
- Text-only: Dense text encoders (multilingual-e5-large, bge-m3, gte-multilingual-base, Llama-3 tuned for embeddings).
- Visual LLM (VLM): Vision encoder (e.g., SigLIP) for page images; text encoder for query; projection MLP for shared embedding alignment; contrastive pair training.
- Precomputed Embeddings: Use FAISS for fast top- candidate selection via cosine similarity.
4. Two-Stage Retrieval and Re-Ranking
Coarse Recall (Guo et al., 23 Dec 2025)
- Precompute for all gallery images (FAISS index, 0.07s/image, %%%%1920%%%%KB/image).
- For query , encode , retrieve top-100 candidates by dot-product similarity.
Fine Re-Ranking
- For top-100 candidates, apply cross-modal interaction models for refined scoring .
- Models: BLIP-ITM, Pix2Struct+ITM with additional cross-attention, trained with hard-negative mining (top-10 hard negatives; mix pointwise and pairwise loss).
- Reranking 0.2s/query for 100 images (parallel processing).
Retrieval pseudocode:
1 2 3 4 5 |
for q in test_queries: C = recall_top100(q) for d in C: s[d] = CrossAttentionITM(f_q(q), f_img(d)) results = sort_by_score(C, s) |
5. Evaluation Metrics and Experimental Comparisons
Metrics
- Recall@:
- MRR@: (truncated at )
- mAP and NDCG@ (optional; standard IR metrics) (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).
Results on NL-DIR (Guo et al., 23 Dec 2025)
- Zero-shot: SigLIP-So400m =36.17, =61.18, MRR@10=43.78. InternVL-14B yields lower recall. VDU models have near-zero recall.
- Fine-tuning: SigLIP-Text-LoRA achieves =54.3, =79.4; SigLIP-Image-LoRA achieves =69.3, =89.7. Best =97.52.
- Re-ranking: ColPali (zero-shot) =79.6, =91.6, MRR=83.8. Fine-tuned BLIP-ITM matches ColPali with fourfold lower storage.
- Subset Analysis: Concrete queries yield higher recall (=86.2, =97.2, =99.4 after rerank); abstract queries are less effectively retrieved.
Multilingual Gap in MIRACL-VISION (Osmulski et al., 16 May 2025)
- Best text embedding: NDCG@10=0.7964.
- Best VLM: NDCG@10=0.5283 (42% lower). On English, text vs. vision gap is 12.1%; in Hindi/Arabic up to 59.7%.
- VLM performance degrades sharply in non-Latin scripts and resource-poor languages.
| Language | Text NDCG@10 | Vision NDCG@10 | Drop (%) |
|---|---|---|---|
| Arabic | 0.8883 | 0.4888 | 45.0 |
| Hindi | 0.7581 | 0.3127 | 58.8 |
| English | 0.7348 | 0.6784 | 7.7 |
| Telugu | 0.9090 | 0.0893 | 90.2 |
6. Efficiency and Practical Considerations
- Offline encoding: SigLIP 0.07s/image, OCR-IR (Tesseract+BGE) 2.56s, DSE 0.62s, ColPali 0.65s per image (Guo et al., 23 Dec 2025).
- Storage: SigLIP 4 KB/image; BGE 3 KB; DSE 6 KB; ColPali 256 KB.
- Online query/rerank: Dot-product selection is sub-ms; reranking 0.2 s for 100 images.
- MIRACL-VISION employs aggressive hard-negative mining to render evaluation manageable yet challenging (Osmulski et al., 16 May 2025).
7. Challenges, Insights, and Future Directions
Key Technical Obstacles
- Fine-grained semantics require models robust to domain shift from natural images to complex layouts, error-prone OCR, and visually rich content.
- Multilingual VLMs lag text-only models, with retrieval efficiency and semantic alignment suffering in non-Latin scripts due to training data gaps and OCR limitations.
Observations
- Contrastive VLMs pretrained on image–OCR pairs (SigLIP) excel in zero-shot setting; cross-attention layers in reranking crucially improve fine-grained matching (Guo et al., 23 Dec 2025).
- OCR-free VDU models outperform when visual information dominates (e.g., charts, handwriting, cartoons).
Recommendations (Osmulski et al., 16 May 2025)
- Build benchmarks from human-annotated queries, ensuring linguistic and script diversity; couple image and text versions for cross-modal gap analysis.
- Incorporate hard-negative mining to maintain retrieval challenge and tractability.
- Improve VLM accuracy via domain-adaptive multilingual fine-tuning, multilingual page image augmentation, OCR-aware vision encoder training, and hybrid fusion of visual/text embeddings.
Future Research Directions (Guo et al., 23 Dec 2025)
- Develop specialized, lightweight cross-modal alignment architectures tuned for document genres.
- Integrate generative VDU and cross-attention in unified rerankers.
- Extend to multi-page retrieval and paragraph-level localization.
- Scale datasets and pre-training on document-centric corpora to bridge existing domain gaps.
- This suggests that addressing semantic and script diversity, while reducing reliance on synthetic LLM-generated queries, is essential for robust, real-world NL-DIR deployment.
Summary
Natural Language-based Document Image Retrieval represents a significant advancement for document access via expressive natural language, characterized by challenging datasets, precise evaluation protocols, and sophisticated cross-modal architectures. The NL-DIR and MIRACL-VISION benchmarks provide comprehensive testbeds, reflecting real-world complexity and multilingual reach. Experimental and efficiency analyses reveal contrastive VLMs and two-stage retrieval pipelines as current state-of-the-art, with clear opportunities for progress in multilingual modeling, fine-grained semantics, and scalable architecture design (Guo et al., 23 Dec 2025, Osmulski et al., 16 May 2025).