Long-Document Retrieval
- Long-document retrieval is a framework that targets extracting and ranking relevant evidence from documents spanning thousands of tokens with dispersed and hierarchical content.
- It employs methods like passage aggregation, hierarchical encoding, and efficient attention (e.g., local self-attention, segment-interaction) to overcome limitations of standard IR models.
- Recent advances focus on dynamic segmentation, multimodal indexing, and unified retrieval-generation pipelines to enhance effectiveness and computational efficiency.
Long-document retrieval (LDR) concerns ad-hoc or question-driven search over corpora of documents whose lengths, internal structure, and evidence dispersion fundamentally challenge standard information retrieval (IR) techniques developed for short texts. LDR encompasses retrieving, ranking, and (in some cases) providing evidence within documents typically spanning thousands to tens of thousands of tokens, often organized via rich, nested structural and semantic signals. Robust LDR solutions are essential across domains such as scientific literature, legal search, web-scale QA, enterprise document management, and multimodal archives.
1. Fundamental Challenges and Problem Scope
LDR departs from standard IR both in problem complexity and solution constraints. Key challenges include:
- Length explosion: As document length ( tokens), bag-of-words models suffer from severe signal dilution and drift. Classical TF–IDF and BM25 treat the entire document as a flat term set, which causes query-centric relevance signals to be overwhelmed by background content (Li et al., 9 Sep 2025).
- Dispersed evidence: Queries may require synthesizing evidence located in distant parts of a document (sometimes thousands of tokens apart), making simple truncation or fixed-windowed processing ineffective for high recall in open-ended QA (Saad-Falcon et al., 12 Feb 2024).
- Hierarchical and multi-modal structure: Long documents encompass headings, sections, tables, and figures, each with their own rhetorical and pragmatic roles. Flat aggregation or chunking approaches cannot model cross-section dependencies or utilize layout cues.
- Resource constraints: The computational and memory cost of standard self-attention mechanisms in Transformer architectures renders end-to-end encoding of long documents intractable for most practical scenarios unless special adaptations are made (Chen et al., 2022).
LDR additionally covers several variants:
| LDR Task | Core Output | Example Domain |
|---|---|---|
| Document-level ranking | Full documents | Web search, legal search |
| Passage/section retrieval | Localized text spans | QA, scientific search |
| Evidence extraction | Span-level or sentence | Fact verification |
| Structure-aware retrieval | Nodes in hierarchy | HTML, PDF, outlines |
| Multimodal retrieval | Text + tables/figures | Patent, news, reports |
2. Model Paradigms, Architectures, and Indexing Techniques
2.1 Passage Aggregation and Divide-and-Conquer
Transformer-based models with context windows –$4096$ tokens necessitated the widespread adoption of passage-level representations. Typical strategies involve segmenting each document into fixed or variable-length chunks, indexing or encoding chunks independently, and aggregating per-query scores with maximum (“MaxP”) or learned weighted aggregation (“PARADE”/SumP) (Li et al., 9 Sep 2025, Li et al., 2021).
However, fixed-size chunking exhibits significant trade-offs. As shown by (Bhat et al., 27 May 2025), smaller (64–128 tokens) chunks optimize for fine-grained fact extraction (e.g., SQuAD), while larger chunks (512–1024 tokens) better capture distributed or high-level context necessary for tasks such as NarrativeQA. Embedding-model-specific biases further complicate chunk-size selection: long-context pretrained decoders (e.g., Stella) benefit from large chunks, while encoder-based models (e.g., Snowflake) perform better with smaller, entity-dense chunks.
Recent advances (e.g., (Dong et al., 23 Apr 2024)) move beyond fixed-length chunking by using content-structure to define boundaries and representing each chunk in multiple semantic views (raw text, summary, keywords). Multi-view content-aware indexing dramatically boosts recall with no model retraining, outperforming all previous chunking baselines by 16–43 absolute points at standard recall@k (WikiWeb2M dataset).
2.2 Hierarchical and Discourse-Aware Models
Hierarchical encoding architectures exploit the document’s natural tree or section structure. In methods such as PARADE and DISRetrieval (Chen et al., 26 May 2025), local chunk representations are aggregated with lightweight transformers (or other neural combiners) that attend to structural relations, discourse coherence, and multi-level context. For instance, DISRetrieval demonstrates that rhetorical structure theory (RST) trees—supplemented by LLM-driven hierarchical summarization at internal nodes—preserve both semantic relationships and natural document flow, yielding substantially improved retrieval as measured by token-level F1 and downstream QA accuracy on QASPER and QuALITY benchmarks.
Element- and structure-aware contrastive frameworks (SEAL (Huang et al., 28 Aug 2025)) further incorporate explicit HTML/markup tags, capturing fine-grained structural semantics and boosting both offline NDCG and production CTR without retriever redesign.
3. Efficient Attention, Sparse Methods, and Model Scaling
The prohibitive scaling of self-attention in vanilla Transformers catalyzed multiple efficient encoding strategies for LDR:
- Local self-attention: Instead of full-sequence attention, sliding-window local attention (TKL (Hofstätter et al., 2020)) restricts context to -width chunks with overlap, allowing models to inspect thousands of tokens per document within practical resource budgets. Kernel pooling and adaptive evidence pooling help surface highly relevant regions efficiently.
- Segment-interaction representations: SeDR (Chen et al., 2022) introduces segment-aware document embeddings by allowing inter-segment [CLS]-token communication at each transformer layer, yielding segment-sensitive but globally contexted document vectors. Retrieval is then cast as a max-pool over all segment similarities.
- State-space and mixer architectures: M2-BERT (Saad-Falcon et al., 12 Feb 2024) replaces quadratic attention via efficient state-space layers, scaling to tokens and outperforming parameter-matched Transformers by over 23 nDCG@10 points on LoCoV1, which covers retrieval tasks where chunking is ineffective. Orthogonal projection losses and half-precision training further enable long-sequence learning under GPU memory constraints.
- Learned sparse retrieval adaptations: Sparse lexical neural models (LSR, e.g. SPLADE) segment documents and aggregate representations or scores via schemes like Rep-max, Score-max, or, critically, sequential dependence models (ExactSDM, SoftSDM) (Nguyen et al., 2023, Lionis et al., 31 Mar 2025). ExactSDM, which models unordered and ordered n-gram proximity within chunks, robustly outperforms prior aggregation heuristics by up to +2 nDCG@10 points on long documents, with negligible benefits from soft expansion for long contexts.
4. Knowledge Distillation, Fine-Grained Representation, and Training Dynamics
Standard knowledge distillation (KD), in which dense retrievers learn from cross-encoder teachers at the document level, fails for LDR due to the scope hypothesis and granularity mismatch (Zhou et al., 2022): cross-encoders can focus on subspans but dense bi-encoders cannot. Fine-grained distillation (FGD) solves this by generating globally consistent multi-granular embeddings—spanning full-doc, passage, and sentence levels—using fixed, attention-based token weightings across granularities, ensuring that all representations live in the same latent space. A multi-granular, aligned distillation loss enforces the match to cross-encoder teachers at each granularity, yielding state-of-the-art gains (e.g., MRR@100=0.440, +1.8 pts over COSTA, on MS-MARCO Document). Each granularity’s contribution is independently critical; document-level KD alone is ineffective. At inference, a single dot product per document suffices, as all complexity is confined to training time.
Models leveraging sentence- or block-level dynamic segmentation (BReps (Li et al., 28 Jan 2025)) further support the paradigm that fine-grained block representations, paired with weighted score aggregation, yield measurable gains (e.g., NDCG@10 +3–19% vs. single-vector LLM baselines) while substantially reducing inference latency.
5. Robustness, Label Noise, and End-to-End Pipelines
Sparse and incomplete relevance annotations, as endemic to large-scale LDR corpora (e.g., MS MARCO Document), demand methods robust to label noise. Bag sampling and group-wise localized contrastive estimation (LCE) (Li et al., 2022) partition negative candidate sets into bags, enabling convolutional and pooling-based aggregation that smooths annotation noise and exploits first-stage retrieval ordering. This "bag" paradigm achieves MRR@100=0.489 (single model) on leaderboard, outperforming traditional dual-encoder and ensemble methods.
Block or bag pre-ranking for local evidence selection (key-block selection (Li et al., 2021)) provides a computationally efficient alternative to passage sliding or full-transformer attention, rescuing cross-block dependencies at a cost comparable to a single BERT pass without custom CUDA kernels.
6. Applications: Structured, Multimodal, and Interactive LDR
Modern LDR systems increasingly address multimodal (text+figure+table), structured, and even video-to-document scenarios:
- HTML/structured document search: SEAL (Huang et al., 28 Aug 2025) and MC-indexing (Dong et al., 23 Apr 2024) incorporate explicit markup signals, increasing recall by up to +43% and improving downstream QA fidelity.
- Multimodal retrieval: Contextualized late-interaction models in MLLMs (e.g., SV-RAG (Chen et al., 2 Nov 2024), URaG (Shi et al., 13 Nov 2025)) treat each page or view as a separate evidence unit, using frozen vision-language LLMs with small retrieval and QA adapters to achieve top-1/5 page accuracy of 83.7%/97.6% (ColQwen2 on VisR-Bench (Chen et al., 10 Aug 2025)), with efficiency advantages (up to ~56% lower FLOPs for URaG). Performance remains bottlenecked by structured tables and low-resource languages.
- Long video understanding: DrVideo (Ma et al., 18 Jun 2024) recasts long-video QA as iterative, frame-centric text LDR over agent-augmented document representations, using standard RAG loops for stepwise evidence accumulation and chain-of-thought final answer production.
- Document-aware passage retrieval: DAPR (Wang et al., 2023) demonstrates that 53.5% of SoTA passage-retriever failures are due to missing document context (coreference, topic, multi-hop). Contextualized passage representations (e.g., prepending titles, keyphrases, coreference annotations) substantially improve retrieval on hard queries, though joint segment-context encoders are needed for true robustness.
7. Open Directions and Future Challenges
Current LDR research faces numerous unresolved issues:
- Efficiency–effectiveness trade-off: Chunking, sparse attention, and local pooling mitigate compute/memory cost but can lose fine-grained signals, limiting nDCG on nuanced QA (Li et al., 9 Sep 2025, Bhat et al., 27 May 2025).
- Dynamic and semantic chunking: Adapting chunk boundaries dynamically (topic segmentation, discourse parsing, query-aware windowing) and leveraging LLM-driven multi-view representations are open research avenues (Chen et al., 26 May 2025, Dong et al., 23 Apr 2024).
- Structure and multimodality: Integrating layout/markup, table structure, and non-textual elements into long-context encoders, as well as defining appropriate multimodal relevance metrics, is largely unsolved (Huang et al., 28 Aug 2025, Chen et al., 10 Aug 2025).
- Faithfulness, attribution, and end-to-end generation: Span-grounded objectives and attribution-aware training are necessary to guarantee that LDR-augmented RAG/generation models do not hallucinate and always cite retrieved, gold-evidence spans.
- Unified retrieval–generation pipelines: The next frontier is joint training of retriever, reader, and generator in a single, end-to-end (potentially LLM-based) architecture that adaptively retrieves further document regions conditioned on downstream uncertainty (Li et al., 9 Sep 2025, Shi et al., 13 Nov 2025).
LDR sits at the intersection of efficient indexing, hierarchical modeling, and reasoning-capable LLMs. Breakthroughs will likely stem from architectures that dynamically combine sparse long-context encoding, structure-aware chunk selection, multi-vector retrieval, and tightly coupled retrieval-generation loops, all rigorously evaluated under span- and structure-aware benchmarks.