Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

Published 8 May 2026 in cs.CV | (2605.08421v1)

Abstract: Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a descriptor-guided global token that integrates spatial and semantic layout for enhanced document retrieval.
It utilizes self-attention and contrastive InfoNCE loss to align visual patches with textual structural cues without increasing inference complexity.
Experiments on the ViDoRe-v2 benchmark reveal significant gains in nDCG@5 and MAP@5, underscoring the synergy of local and global representations.

Descriptor-Guided Global Layout Modeling for Visual Document Retrieval

Problem Formulation and Motivation

Visual Document Retrieval (VDR) systems increasingly power knowledge-intensive applications, such as retrieval-augmented generation (RAG), where accurate retrieval of heterogeneous, visually structured documents is essential for factual grounding. Conventional late-interaction retrievers, often leveraging MaxSim over local patch embeddings, treat documents as unordered “bags-of-patches” and are optimized for local visual-textual matches. This inductive bias prioritizes fine-grained evidence—individual words, cells, or icons—while underrepresenting global document layout. As a result, these systems may fail to recognize relevance rooted in cross-region relationships, especially where semantic grounding is contingent upon the holistic arrangement of text, tables, and images. The origin of this failure is structural: late-interaction models lack explicit mechanisms to encode and leverage global layout signals, which are crucial for complex, visually diverse documents.

Methodology: Structured Global Representation via Textual Descriptors

The paper addresses this architectural constraint by injecting a trainable global layout token—a dedicated aggregation variable—into the multi-vector vision encoder. This token is contextualized over all patches using self-attention, thereby encoding spatial and semantic relationships that are lost in independently processed patches. Crucially, during training, the global token is optimized using automatically generated textual descriptors that summarize the page’s spatial and organizational structure, abstracting away local content details. The key components of the approach are:

Descriptor-Guided Global Token: Appending a [CLS]-style global token to the patch sequence allows joint modeling of local and global layout information within the transformer layers.
Textual Descriptor Supervision: High-level, automatically generated textual summaries of layout (not content) serve as supervisory signals, guiding the global token to encode true page-level structure.
InfoNCE Alignment: A contrastive loss (InfoNCE) aligns the global visual token with its textual structural descriptor counterpart, ensuring that structural semantics are embedded in the retrieval representation.
Auxiliary Patch-Descriptor Alignment: Local patch embeddings are also aligned with relevant tokens from the structural descriptor, grounding fine-grained evidence in global context, and ensuring local-global synergy.
No Descriptor Dependence at Inference: At retrieval time, the model operates solely on images, maintaining computational efficiency (critical for real-time or large-scale deployment).

The joint objective combines the standard late-interaction retrieval loss and the two structural alignment losses, implemented efficiently using LoRA adapters with frozen base encoder weights.

Experimental Results and Numerical Performance

Evaluation on the ViDoRe-v2 benchmark, spanning four domains (econ/reports, biomedical, ESG human-tagged, and ESG restaurant), demonstrates strong empirical gains for the proposed model, referred to as Gbl-Desc-FT. Compared to the strongest architecturally compatible baseline, ColQwen2.5, Gbl-Desc-FT achieves:

+2.4 nDCG@5 and +2.3 MAP@5 average improvements across datasets.
Statistically significant gains on layout-rich datasets, including ESGH and ESGR, which require holistic spatial reasoning.
Parameter efficiency: The 3B-parameter Gbl-Desc-FT model matches or surpasses the performance of much larger systems (e.g., ColNomic-7B), demonstrating the effect of structural inductive bias over brute-force scaling.

Ablation studies confirm that neither local nor global cues alone are sufficient: high retrieval accuracy emerges from their combination. Removal of the global token or its alignment loss consistently degrades nDCG and MAP scores. Moreover, naive pooling strategies (mean, max, median over patches) for global representation are substantially inferior, highlighting the necessity of a dedicated, explicitly supervised global token.

Implications and Theoretical Perspective

This research concretely exposes and addresses the structural shortcoming of standard late-interaction VDR models—the inability to explicitly model and retrieve on the basis of document-level layout. The methodology establishes that global layout priors can be internalized directly into the retrieval embedding space using auxiliary textual supervision without incurring inference-time cost. This enables compact models to rival and sometimes exceed the empirical performance of much larger, less structured multimodal retrievers, particularly on tasks where spatial and semantic heterogeneity governs relevance.

Implications include:

RAG Pipelines: Upstream retrieval quality for multi-modal RAG in complex documents is directly improved via structured global reasoning.
Model Compression and Deployment: Explicit structural modeling allows for smaller retrieval models with competitive or superior performance compared to parameter-scaled baselines, reducing memory and inference costs.
Generalization Across Domains: The approach demonstrates robust transfer across substantially different document domains, provided structural cues are salient and coupled with local evidence.

Theoretically, the work suggests that retrieval priors can and should be instantiated as parameterized, task-specific structural representations, rather than left to implicit or ad hoc aggregation mechanisms. It also provides an empirical argument for the use of auxiliary semantic supervision—beyond standard image-caption pairs—to encode inductive biases required for advanced document search.

Prospects for Future Research in AI

Potential directions prompted by this framework include:

Multi-page and Cross-Document Reasoning: Extending global layout modeling across sequences of pages, enabling structured retrieval in long-form, multi-page documents with complex inter-page references.
Generalized Structured Supervision: Applying similar descriptor-driven global tokens to other modalities (e.g., 3D scene retrieval, multi-modal graph search) or to tasks involving hierarchical structure (such as scientific literature, legal documents).
Joint Global-Local Adaptation: Exploring adaptive mechanisms that dynamically contextualize the balance of global versus local retrieval, based on query intent or document typology.

Conclusion

The descriptor-guided global token paradigm demonstrates an effective, efficient mechanism for encoding structural layout priors directly into late-interaction visual document retrievers. The presented approach rectifies the limitation of “bag-of-patches” models by explicitly incentivizing the retrieval representation to internalize cross-region and layout-level semantics, delivering pronounced performance improvements on layout-intensive benchmarks. These findings advocate for a class of retrieval models that marry fine-grained and holistic evidence, setting a new standard for retrieval in visually diverse document corpora and suggesting further innovation at the intersection of multimodal representation learning and structure-aware AI.

Markdown Report Issue