Adaptive Visual In-document Retrieval (AVIR)

Updated 24 January 2026

The paper introduces AVIR, a retrieval-augmented pipeline that reduces processed pages by about 70% while maintaining or improving state-of-the-art VQA accuracy.
AVIR employs a lightweight page retrieval model with self-attention scoring and adaptive clustering to efficiently select relevant pages from multi-page documents.
Empirical results on MP-DocVQA, SlideVQA, and DUDE benchmarks demonstrate improved efficiency and competitive performance in complex visual question answering tasks.

Adaptive Visual In-document Retrieval (AVIR) is a retrieval-augmented pipeline for efficient and accurate multi-page document question answering. It addresses the computational and attention challenges inherent in long-form visual question answering (VQA) by scoring and adaptively selecting document pages most relevant to a given question, subsequently leveraging a frozen large vision-LLM (LVLM) for answer generation. AVIR reduces the average number of processed pages by approximately 70% compared to standard end-to-end approaches, while maintaining or surpassing state-of-the-art accuracy across multiple VQA benchmarks (Li et al., 17 Jan 2026).

1. Framework Overview

AVIR operates through a multi-stage retrieval and selection design. The system begins with a lightweight page retrieval model that assigns relevance probabilities to each page with respect to the question. Pages are then selected using an adaptive clustering and thresholding routine: for longer documents, k-means clustering with $k=2$ partitions the pages by score; for shorter documents, a simple thresholding strategy is used. The final subset of relevant pages is fed into a frozen LVLM, which generates answers without additional fine-tuning.

A plausible implication is that AVIR’s modularity allows easy integration with new LVLM architectures, where only the retrieval head requires adaptation per dataset or domain.

2. Lightweight Page Retrieval and Scoring

Architecture

Encoder: The visual backbone uses a frozen Pix2Struct image-and-text encoder ( $\approx$ 1.5B parameters), processing each document page $p_i$ rendered at 512×512 pixels and concatenated with the question text $Q$ .
Hidden State Extraction: Output is a sequence of hidden states $H_i \in \mathbb{R}^{T \times d}$ , including a special [CLS] token embedding $h_i \in \mathbb{R}^{d}$ per page.
Scoring Head: On top of $h_i$ , two self-attention layers (dimension $d$ ), followed by a linear layer and sigmoid activation, yield relevance probability $r_i \in (0,1)$ :

$a_i = \mathrm{LayerNorm}(h_i) \ z_i = \mathrm{SelfAttn}_2(\mathrm{SelfAttn}_1(a_i)) \ r_i = \sigma(w^\top z_i + b)$

where $\sigma(\cdot)$ denotes the sigmoid function. The scoring head adds approximately 100M trainable parameters (∼4% of the LVLM).

Training

Binary labels $y_i \in \{0,1\}$ indicate whether a page contains the answer span. The loss function sums binary cross-entropy over pages:

$L = -\sum_{i=1}^n \left[ y_i \log r_i + (1-y_i) \log (1-r_i) \right]$

3. Adaptive Page Selection

AVIR’s selection algorithm adapts to document length and score distribution.

Long Documents ( $n \geq 4$ )

Clustering: Apply k-means ( $k=2$ ) on scores $\{r_i\}$ , yielding centroids $\mu_0 < \mu_1$ .
Relevant Cluster Formation: Pages assigned to the cluster with centroid $\mu_1$ form the candidate set $P_{rel} = \{ p_i \mid c_i = 1 \}$ .
Top- $K_{max}$ Filtering: If $|P_{rel}| > K_{max}$ (with $K_{max}=8$ ), select the top $K_{max}$ pages by descending $r_i$ .

Short Documents ( $n < 4$ )

A fixed relevance-probability threshold $T = 0.6$ selects pages:

If $\max_i r_i \geq T$ , choose $P_{selected} = \{p_i \mid r_i \geq T\}$ .
Otherwise, select all pages.

Pseudocode

def AdaptivePageSelector(P, R, T=0.6, K_max=8):
    n = len(P)
    if n < 4:
        if max(R) >= T:
            return [p_i for p_i, r_i in zip(P, R) if r_i >= T]
        else:
            return P
    else:
        # k-means on R with k=2
        c_i, mu_0, mu_1 = kmeans(R, k=2)
        R_rel = [p_i for p_i, c in zip(P, c_i) if c == (0 if mu_1 < mu_0 else 1)]
        if len(R_rel) > K_max:
            R_rel = sorted(R_rel, key=lambda p: R[P.index(p)], reverse=True)[:K_max]
        return R_rel

4. LVLM Integration for Answer Generation

Selected pages (typically 2–4 per query) are rendered and packed into the input stream to a frozen Qwen2.5-VL-3B model, quantized with AWQ. The final input consists of the text question prefixed (“Answer the question based on the following pages. Question: …”) and pixel grids of each selected page, maintaining original reading order. The LVLM receives no parameter updates; all adaptation occurs in the retrieval component. Cross-modal attention in the LVLM enables contextual reasoning over the compressed document subset and answer generation proceeds via token-wise autoregressive decoding.

This suggests that the separation between retrieval and answer models can decouple computational costs for context expansion from model parameter adaptation.

5. Empirical Results and Benchmark Performance

AVIR’s effectiveness has been evaluated on MP-DocVQA, SlideVQA, and DUDE datasets, with notable reductions in processed pages and improvements in prediction accuracy metrics.

MP-DocVQA
- Average processed pages per query: 2.5 (vs. 8.3 for end-to-end).
- Top-1 Page Prediction Accuracy: 81.55%.
- Average Normalized Levenshtein Similarity (ANLS): 0.8458.
SlideVQA
- Average processed pages per query: 2.9 (vs. 20.0 for end-to-end).
- Exact Match (EM): 60.3%.
- F1: 68.9%.
DUDE
- Overall ANLS: 0.4905 (Extractive: 0.6754, Abstractive: 0.6404).

Comparative Tables

MP-DocVQA Results

Method	Params	Page-Pred (%)	ANLS
SelfAttnScoring	273 M	81.55	0.6199
M3DOCRAG (8 B)	—	81.05	0.8444
Qwen2.5-3B-AWQ	—	—	0.8405
AVIR (ours)	3 B	81.55	0.8458

SlideVQA Results

Model	Params	EM	F1
Qwen2.5-3B-AWQ	3 B	56.6	65.8
AVIR (ours)	3 B	60.3	68.9

Top-K Ablation (SlideVQA)

Strategy	Avg Pages	EM	F1
No Retrieval (20 pg)	20.0	56.6	65.8
Top-K=1	1.0	52.7	60.2
Top-K=2	2.0	56.5	64.6
Top-K=4	4.0	58.3	66.8
Top-K=8	8.0	57.2	65.7
APS (ours)	2.9	60.3	68.9

A plausible implication is that adaptive selection strategies allow for more efficient context compression than naive top- $K$ methods while retaining or improving predictive accuracy.

6. Significance and Practical Considerations

AVIR’s design mitigates computational bottlenecks and challenges posed by long documents in visual question answering, notably reducing the cost of LVLM inference without model fine-tuning. The retrieval module’s light footprint (∼100M parameters over a frozen LVLM) means adding AVIR does not require retraining large-scale multimodal models. Empirical performance demonstrates that adaptively selected context windows outperform fixed-length alternatives, delivering robust gains on multi-page scientific and presentation-style document benchmarks.

The modularity and dataset-specific retrieval adaptation suggests AVIR is applicable to other visual document analysis tasks requiring selective context curation for downstream language reasoning.

Markdown Report Issue Upgrade to Chat

References (1)

AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Visual In-document Retrieval (AVIR).