Long-Document Retrieval Overview

Updated 10 September 2025

Long-document retrieval is a specialized IR task that processes lengthy texts with dispersed, non-contiguous relevant information and complex internal structures.
Recent methods use passage aggregation, hierarchical encoding, and efficient attention mechanisms to effectively capture both local and global document features.
LLM-driven techniques enhance ranking and relevance scoring while addressing challenges in efficiency, faithfulness, and multimodal integration.

Long-document retrieval (LDR) refers to the task in information retrieval (IR) where candidate documents are sufficiently long—often thousands to tens of thousands of tokens—that evidence is dispersed, relevant passages are embedded deep in the text, and document structure is complex. Compared to passage-level or short-document retrieval, LDR presents distinct algorithmic, efficiency, and evaluation challenges. The most recent survey systematizes LDR methods in the context of pre-trained LLMs (PLMs) and LLMs, highlighting the era's trends, model architectures, and ongoing challenges (Li et al., 9 Sep 2025).

1. Problem Definition and Distinguishing Challenges

Long-document retrieval is defined by several unique properties distinguishing it from classic passage retrieval:

Evidence Dispersion: Relevant information is not localized to a single contiguous span but distributed throughout the document.
Complex Internal Structure: Documents may have hierarchical, multimodal, or cross-referenced content that standard sequence encodings do not capture.
Length-related Efficiency: Neural approaches that work efficiently on passages (hundreds of tokens) often face prohibitive compute/memory costs for whole documents.
Modeling Gaps: Many retrievers collapse long documents to single or fixed-size embeddings, losing granularity and positional details critical for relevance.
Faithfulness and Interpretability: Summaries or answers based on retrieval require span-based attribution, which is challenging as document length increases.

Typical applications include legal contracts, scientific papers, clinical reports, patents, and multilingual/multimodal organizational documents, all of which display high evidence dispersion and demand both precision and coverage in retrieval (Li et al., 9 Sep 2025).

2. Taxonomy of Methods: Classical to LLM-driven Paradigms

The evolutionary trajectory of LDR methods is organized into several core paradigms:

Paradigm	Core Idea	Typical Equation/Mechanism
Passage Aggregation	Chunk documents, aggregate scores	MaxP: $S(d, q) = \max_i f_{\text{BERT}}(q, p_i)$ <br>SumP: $S(d, q) = \sum_i f_{\text{BERT}}(q, p_i)$
Hierarchical Encoding	Multi-level aggregation, attention	Encode (q, p_i) $\to h_i$ , aggregate $h_i$ via attention or transformer layers
Efficient Attention	Scale self-attention to long input	Local/global/blockwise/LSH attention (e.g., Longformer, BigBird)
LLM-driven Retrieval	Use/orchestrate LLMs for ranking	Prompt-based listwise ranking, fine-tuned LLM retrievers, hybrid cascades

Passage aggregation methods (e.g., BERT-MaxP/SumP) segment documents into passages, score each separately, and aggregate. Hierarchical models (e.g., PARADE, DRSCM) learn to jointly synthesize local evidence and inter-passage dependencies, using neural attention or additional transformer layers.

Efficient attention mechanisms (e.g., Longformer, BigBird) tackle the quadratic cost by adopting sparse or sliding-window attention, allowing the encoder to process thousands of tokens with linear or blockwise scaling.

LLM-driven techniques include both instruction-based listwise reranking—where LLMs like GPT-4 generate explicit rankings for a shortlist of candidates—and fine-tuned dense retrievers based on LLaMA or GPT architectures, which can handle longer contexts and produce embeddings or direct scalar scores (Li et al., 9 Sep 2025).

3. Critical Modeling Issues and Algorithms

Coverage and Aggregation

Coverage-based Scoring: For query $Q$ and document $D$ (both split into sentences), compute cosine similarities $s_{ij} = \cos(u_i, v_j)$ . Define coverage as:

$\text{QP}(Q\to D) = \frac{1}{m}\sum_{i=1}^m I(\max_j s_{ij} \geq \tau),\quad \text{DP}(D\to Q) = \frac{1}{n}\sum_{j=1}^n I(\max_i s_{ij} \geq \tau)$

The Proportional Relevance Score:

$S_{RPRS}(Q, D) = \text{QP}(Q\to D) \times \text{DP}(D\to Q)$

promotes retrieval of documents covering as much of the query and vice versa (Askari et al., 2023, Li et al., 9 Sep 2025).

Hierarchical and Context-Aware Aggregation

Hierarchical aggregation employs learned weights or attention to synthesize passage representations, as in PARADE (Greaves et al., 2022).
Discourse-aware retrieval leverages document rhetorical structure or content-aware chunking to preserve semantic units during segmentation, facilitating retrieval of coherent evidence spans (Chen et al., 26 May 2025, Dong et al., 23 Apr 2024).

Efficient Long Sequence Modeling

Efficient self-attention variants reduce complexity from $O(n^2)$ to $O(n w)$ or $O(n)$ by imposing windowed, blockwise, or random attention patterns (Longformer, BigBird).
Monarch Mixer and state-space models provide subquadratic scaling for long context representation (as in M2-BERT for up to 32K tokens) (Saad-Falcon et al., 12 Feb 2024).

4. Domain-Specific Applications and Benchmarks

LDR methods are critical in verticals demanding deep document reasoning:

Legal retrieval: Emphasizes hierarchical structure, cross-version alignment, and clause matching in statutes and contracts; Lawformer and LLM-driven systems like LawGPT exemplify adaptations (Nguyen et al., 26 Mar 2024, Li et al., 9 Sep 2025).
Scholarly document search: Requires capturing section headings and scientific discourse (methods, results, conclusions); approaches leverage citation and aspect decomposition (Li et al., 9 Sep 2025).
Biomedical retrieval: Must process highly technical, long clinical records or scientific articles, often using domain-adapted encoders with structure preservation (Li et al., 9 Sep 2025).
Cross-lingual and multimodal retrieval: Demands robust multilingual encoders and handling of structural artifacts (figures, tables, images) (Chen et al., 10 Aug 2025).

Benchmarks such as TREC DL, Robust04, MS MARCO Document, QASPER, and domain-specific datasets (e.g., MLDR, VisR-Bench) capture these challenges in experimental settings.

5. Evaluation Resources and Metrics

Traditional IR metrics (nDCG, MAP, Recall, MRR) remain standard, but the survey notes intrinsic limitations in capturing LDR-specific aspects such as:

Evidence dispersion: Retrieval metrics may not penalize missing a relevant passage buried deep in the document.
Span-level attribution: Faithfulness requires not only correct ranking but that retrieved spans truly support downstream answers or summaries.
Multimodality and structure: Evaluating retrieval that must span text, tables, and figures or resolve references is nontrivial.

The survey calls for span-sensitive and structure-aware metrics that better reflect model faithfulness and recall in the LDR context.

6. Open Challenges and Future Research

Key challenges identified for advancing LDR include:

Efficiency–effectiveness trade-offs: Even blockwise or local self-attention models face scaling bottlenecks when applied to million-document corpora; sparse attention can miss dispersed signals or impair global aggregation.
Multimodal and cross-lingual alignment: Real-world long documents may be multimodal, requiring integrated text–table–figure reasoning and cross-lingual mapping; few systems robustly handle these settings (Chen et al., 10 Aug 2025).
Faithfulness and explainability: LLM-driven retrievers may hallucinate or produce ungrounded rationales; models must provide auditable span-level justifications and minimize "hallucination."
Data scarcity and annotation cost: Large-scale, fine-grained supervised datasets are rare due to annotation costs for long documents; self-supervised, transfer, or domain-adaptive strategies are called for.
Unified retrieval–generation integration: The field is trending toward hybrid architectures that combine efficient indexing, structure-preserving chunking, and LLM-guided reasoning in iterative retrieval-generation cycles, but cost–benefit tradeoffs remain unresolved.

The survey argues that hybrid methods—integrating structure-aware segmentation, efficient indexing, and LLM-powered reasoning—are likely necessary for robust, scalable LDR (Li et al., 9 Sep 2025). New benchmarks and metrics, as well as stronger domain adaptation techniques and multimodal processing capabilities, remain critical research frontiers.

In summary, long-document retrieval is characterized by challenges of content length, dispersed evidence, and complex or multimodal structure, and has evolved through passage aggregation, hierarchical and efficient attention methods, and recent LLM-driven ranking architectures. Current research continues to address efficiency, structure, faithfulness, and scalability, with hybrid approaches and multimodal alignment seen as key trajectories for the next generation of systems (Li et al., 9 Sep 2025).