- The paper presents MMDocIR, a benchmark that defines dual retrieval tasks to overcome quality and granularity limitations in multi-modal document retrieval.
- It shows that visual-driven retrievers, especially using VLMs, outperform text-based methods in both page-level and layout-level tasks.
- Experiments yield high annotation quality (F1 scores of 95.2% and 87.1%) and underscore the practical benefits of prioritizing visual information in retrieval.
The paper introduces the Multi-Modal Document Information Retrieval (MMDocIR) benchmark for evaluating multi-modal document retrieval systems. The authors highlight the limitations of existing benchmarks in terms of question quality, document quality, and retrieval granularity. To address these limitations, MMDocIR is structured around two tasks: page-level retrieval and layout-level retrieval. The page-level retrieval task aims to identify the most relevant pages within a document in response to a user query, while the layout-level retrieval task focuses on retrieving specific layouts, such as paragraphs, equations, figures, tables, and charts.
The MMDocIR benchmark includes an evaluation set comprising 313 documents with expert-annotated labels for 1,658 question-answer (QA) pairs, and a training set comprising 6,878 documents and labels for 73,843 QA pairs. The evaluation set is derived from the MMLongBench-Doc and DocBench datasets, with questions filtered and revised to suit document retrieval tasks. The annotation process involves page-level and layout-level labeling, with rigorous quality control measures in place, achieving an F1 score of 95.2% for page-level annotations and 87.1% for layout-level annotations.
The authors conduct experiments to evaluate existing multi-modal document retrieval baselines, which are categorized into visual-driven and text-driven retrievers. Visual-driven retrievers use Vision LLMs (VLMs) to generate embeddings for queries and documents, while text-driven retrievers rely on Optical Character Recognition (OCR) or VLMs to convert multi-modal content into text before employing LLMs (LMs). The experimental results demonstrate that visual-driven retrievers outperform text-driven retrievers. The authors also train two visual retrievers, DPR-Phi3 and Col-Phi3, based on Phi3-Vision, and evaluate their effectiveness using the MMDocIR training set.
The methodology involves an offline indexing phase, where each page and layout is transformed into a vector representation, and an online querying phase, where a query is converted into a vector and compared against the indexed vectors using similarity scores. The similarity between the query and the document is computed using cosine similarity for DPR-Phi3 and a maximum dot product for Col-Phi3.
Key findings from the experiments include:
- Visual retrievers outperform text retrievers in page-level retrieval, highlighting the importance of visual elements.
- VLM-text approaches, while underperforming visual retrievers, perform better than OCR-text methods.
- Token-level retrievers achieve more advantageous results in Recall@1 compared to page-level counterparts but have higher storage overhead.
- In layout-level retrieval, visual retrievers show performance advantages over text retrievers using OCR-text.
- VLM-text approaches achieve comparable performance to visual retrievers in layout-level retrieval.
- Hybrid image-text sequences in visual retrievers perform less effectively than pure image sequences.
The authors also analyze the differences between OCR and VLM text, noting that VLM-text is longer and more comprehensive, although it comes with higher computational overhead.
In summary, the paper presents the MMDocIR benchmark as a resource for advancing multi-modal document retrieval, with a dual-task retrieval framework and a comprehensive evaluation of existing retrieval systems. The results emphasize the importance of visual information in multi-modal document retrieval and highlight the potential benefits of using VLMs.
Variables used in the LaTeX formulas:
- D: Document corpora
- P: Set of document pages, P={p1,p2,…,pn}
- pi: individual page
- n: total number of pages
- L: Set of layouts, L={l1,l2,…,lm}
- li: individual layout
- m: total number of layouts
- Q: Query
- Sim(Q,p): Similarity score between query Q and page p
- Sim(Q,l): Similarity score between query Q and layout l
- Eddpr: DPR embedding of the document
- Eqdpr: DPR embedding of the query
- Mphi3v: Phi3-Vision model
- Mvit: ViT model
- d: document
- q: query
- D1: dimension of the last hidden state of $\mathbf{M_{phi3v}$
- Edcol: ColBERT embedding of the document
- Eqcol: ColBERT embedding of the query
- Mproj: Projection layer
- D2: Reduced dimension after projection
- Nd: Number of document tokens
- Nq: Number of query tokens
- Sim(q,d)dpr: Similarity between query q and document d using DPR
- ⟨⋅∣⋅⟩: Dot product
- ∥⋅∥: Norm
- Sim(q,d)col: Similarity between query q and document d using ColBERT
- $\mathrm{E_{q}^{col}^{(i)}}$: i-th query vector
- $\mathrm{E_{d}^{col}^{(j)}}$: j-th document embedding vector
- d+: Positive document
- d−: Negative document
- L(q,d+,d−)dpr: Loss for DPR-Phi3
- τ: Temperature parameter
- L(q,d+,d−)col: Loss for Col-Phi3