MMLongBench-Doc Benchmark
- MMLongBench-Doc is a specialized benchmark for long-context, multi-modal document understanding that stresses LVLMs with lengthy, information-dense PDFs.
- It comprises 130 curated documents with expert-annotated, cross-page queries, balancing single-page, cross-page, and unanswerable challenges.
- Empirical results reveal key gaps in evidence localization and hallucination resistance, underscoring the need for improved multi-modal reasoning methods.
MMLongBench-Doc is a specialized benchmark designed to rigorously evaluate the long-context, multi-modal document understanding capabilities of large vision-LLMs (LVLMs). Unlike prior document understanding datasets, which are typically limited to short sequences or single pages, MMLongBench-Doc systematically stresses models with lengthy, information-dense documents composed of complex layouts and diverse modalities—including text, tables, charts, and images. The benchmark provides a unified, high-density evaluation suite that exposes fundamental bottlenecks in evidence localization, cross-page reasoning, and hallucination resistance for contemporary LVLMs (Ma et al., 2024).
1. Benchmark Motivation and Distinguishing Features
MMLongBench-Doc was created to address deficiencies in existing document understanding (DU) benchmarks, which predominantly measure performance on single-page or short multi-page documents using simple, localized queries. Real-world documents, by contrast, often span dozens of pages and require the integration of evidence from disparate sources and non-contiguous formats. Key challenges targeted by the benchmark include:
- Localization: Efficiently retrieving relevant information from among thousands of tokens and mixed visual components.
- Cross-page Reasoning: Integrating information spread across multiple, often distant, pages to synthesize correct answers.
- Hallucination Resistance: Distinguishing answerable from unanswerable queries, particularly for questions where required evidence is absent.
Prior benchmarks (e.g., DocVQA, ChartQA, InfoVQA, TAT-DQA, DUDE, SlideVQA, MP-DocVQA) either focus on single-page tasks or fail to provide dense, multi-modal, cross-page scenarios. MMLongBench-Doc overcomes these limitations by explicitly constructing long, high-density PDF contexts, curating cross-page and multi-modal queries, and incorporating a substantial proportion of unanswerable questions for hallucination detection (Ma et al., 2024).
2. Dataset Design and Construction
Document and Question Composition
MMLongBench-Doc contains 130 curated PDF documents spanning seven domains (research reports, financial reports, academic papers, brochures, guidelines, administrative/industrial files, tutorials/workshops), with an average length of 49.4 pages and 20,971.9 tokens per document. Source material includes both publicly available datasets (DUDE, SlideVQA, ChartQA, FinanceBench) and newly acquired long-form materials such as arXiv papers and manuals.
A total of 1,062 expert-annotated questions were crafted following an intensive multi-stage review process:
- Retain/Revise/Remove editing for quality, non-triviality, and document dependence, addressing issues of ambiguity, document irrelevance, and answer redundancy.
- Evidence and answer meta-data were systematically recorded, specifying for each question: answer (string, int, float, or list), evidence modality (text, layout, table, chart, image), and supporting page indices.
The question set is intentionally balanced:
- Single-page: 467 (44.0%)
- Cross-page: 353 (33.2%)
- Unanswerable: 242 (22.8%)
Evidence distribution is broad: Text (36.1%), Layout (14.5%), Table (25.9%), Chart (20.5%), Image (34.4%) (Ma et al., 2024).
Example Table: Dataset Breakdown
| Subset | Count (%) | Notes |
|---|---|---|
| Single-page | 467 (44.0%) | Localized evidence |
| Cross-page | 353 (33.2%) | Scattered across ≥2 pages |
| Unanswerable | 242 (22.8%) | Critical for hallucination quantification |
| Text evidence | 296 (36.1%) | Includes pure text, sometimes with offsets |
| Table evidence | 212 (25.9%) | Structured tables |
| Chart evidence | 168 (20.5%) | Figures, plots, charts |
| Image evidence | 282 (34.4%) | Natural or synthetic visual content |
3. Annotation Process and Evidence Representation
The annotation protocol combines expert human judgment with LLM-assisted quality control:
- Annotator Training: Ten PhD-level experts are instructed to balance modality, page count, and avoid trivial or easily bypassed queries.
- Iterative QA QC: Three quality layers—
- Document-relevance filtering by GPT-4o,
- LLM-based “reflection” with disputes arbitrated by annotators,
- Cross-annotator adjudication and review.
Evidence spans are annotated primarily at the page level, with text spans occasionally labeled at the offset or line index. For layout-dependent queries, annotators tag whether layout structure (e.g., heading sequence) is necessary. Visual elements—charts, tables, figures—are referenced via screenshot and page indices.
A typical cross-modal, cross-page query: “According to Chart 5.1 on pages 12 and 13, which two regions saw the largest absolute increase in population between 2010 and 2020?”—which demands integrating information from multiple visual sources and textual contexts (Ma et al., 2024).
4. Evaluation Protocol and Metric Suite
A three-stage evaluation pipeline is employed:
Free-Form Generation: Candidate answers generated by the LVLM under test.
- LLM-based Answer Extraction: Standardized GPT-4o prompting is used to extract and normalize answers for robust comparison.
- Rule-Based Scoring:
- For all questions:
- Metrics are adapted to answer format (short string: SubEM or ROUGE-F1; integer/float: exact/relative match; list: Greedy List Match).
Inputs for LVLMs comprise page-level screenshots (120 DPI PNG), with sequence-length truncation according to model limits (up to 120 images for proprietary; ≤5 for open-source). Text-only LLMs are evaluated using Tesseract OCR–parsed text truncated to window bounds.
5. Empirical Results and Diagnostic Analyses
MMLongBench-Doc exposes severe performance and reliability gaps in current LVLMs:
- Overall F1 Performance (best at 128k context window):
- GPT-4o: 42.7%
- GPT-4V: 31.4%
- Gemini-1.5-Pro: 20.9%
- Claude-3 Opus: 18.6%
- Leading open-source LVLMs: <15%
- Mixtral-8×22B (OCR+LLM): 25.0% (best open-source LLM pipeline)
- Disaggregated Scores (GPT-4o):
- Single-page: 54.0%
- Cross-page: 37.5%
- Unanswerable: 19.8%
A critical finding is that many LVLMs underperform simple OCR+LLM pipelines, particularly for text- and table-occurring queries—a plausible implication is that LVLMs’ multi-modal reasoning remains fragile and OCR integration is a rate-limiting step.
Oracle-Page Study
When only the annotated “oracle” evidence pages are supplied, F1 for representative LVLMs rises by 10–30 points, affirming that evidence localization—not just reasoning or perception—is rate-limiting.
Error Analysis
For GPT-4o, leading error categories include:
- Hallucination (27%)
- Perceptual errors (22%)
- Incomplete evidence extraction (18%)
- Reasoning/logic (12%)
- Knowledge gaps (10%)
- Irrelevance (11%)
Perceptual and hallucination failures are especially pronounced on cross-page and visual queries.
6. Modeling Recommendations and Research Directions
The results of MMLongBench-Doc motivate several directions:
- Hierarchical or Sparse Attention: Required to enable efficient access across tens/hundreds of pages (Ma et al., 2024).
- Evidence Pre-selection: Incorporation of learned or retrieval-based evidence selectors prior to reasoning could address localization bottlenecks.
- Hallucination Control: Unanswerability detection, faithful generation objectives, and adjudication pipelines (e.g., as in DocLens (Zhu et al., 14 Nov 2025)) offer avenues to reduce false positive rates.
- Multi-modal Pre-training: Pre-training on large, multi-page, multi-modal corpora is likely prerequisite for robust performance.
- Structural/Visual Parsing Integration: Augmenting models with document parsers—table extractors, chart/figure analyzers—could address complex layout and visual reasoning deficiencies.
- Fine-grained Oracle Analysis: Studying upper bounds and breakdowns by evidence modality may help isolate failure modes.
- Open-Sourcing: MMLongBench-Doc sets an accessible, well-annotated baseline for comparative evaluation and ablation studies in vision-language modeling.
7. Benchmark Significance and Impact
MMLongBench-Doc defines the state of the art for benchmarking long-context, multi-modal document understanding. By offering high-density, evidence-balanced queries over long, richly formatted PDFs, it constitutes a critical testbed for models addressing realistic enterprise, research, and industrial use cases. The benchmark’s detailed annotation, rigorous evaluation, and transparent error analyses provide actionable diagnostics for model improvement. Results demonstrate that even top-tier LVLMs have substantial ground to cover—particularly in evidence localization, cross-modal integration, and hallucination suppression—before they can be reliably deployed in demanding document processing scenarios (Ma et al., 2024, Zhu et al., 14 Nov 2025).