Adaptive Visual In-document Retrieval (AVIR)
- The paper introduces AVIR, a retrieval-augmented pipeline that reduces processed pages by about 70% while maintaining or improving state-of-the-art VQA accuracy.
- AVIR employs a lightweight page retrieval model with self-attention scoring and adaptive clustering to efficiently select relevant pages from multi-page documents.
- Empirical results on MP-DocVQA, SlideVQA, and DUDE benchmarks demonstrate improved efficiency and competitive performance in complex visual question answering tasks.
Adaptive Visual In-document Retrieval (AVIR) is a retrieval-augmented pipeline for efficient and accurate multi-page document question answering. It addresses the computational and attention challenges inherent in long-form visual question answering (VQA) by scoring and adaptively selecting document pages most relevant to a given question, subsequently leveraging a frozen large vision-LLM (LVLM) for answer generation. AVIR reduces the average number of processed pages by approximately 70% compared to standard end-to-end approaches, while maintaining or surpassing state-of-the-art accuracy across multiple VQA benchmarks (Li et al., 17 Jan 2026).
1. Framework Overview
AVIR operates through a multi-stage retrieval and selection design. The system begins with a lightweight page retrieval model that assigns relevance probabilities to each page with respect to the question. Pages are then selected using an adaptive clustering and thresholding routine: for longer documents, k-means clustering with partitions the pages by score; for shorter documents, a simple thresholding strategy is used. The final subset of relevant pages is fed into a frozen LVLM, which generates answers without additional fine-tuning.
A plausible implication is that AVIR’s modularity allows easy integration with new LVLM architectures, where only the retrieval head requires adaptation per dataset or domain.
2. Lightweight Page Retrieval and Scoring
Architecture
- Encoder: The visual backbone uses a frozen Pix2Struct image-and-text encoder (1.5B parameters), processing each document page rendered at 512×512 pixels and concatenated with the question text .
- Hidden State Extraction: Output is a sequence of hidden states , including a special [CLS] token embedding per page.
- Scoring Head: On top of , two self-attention layers (dimension ), followed by a linear layer and sigmoid activation, yield relevance probability :
where denotes the sigmoid function. The scoring head adds approximately 100M trainable parameters (∼4% of the LVLM).
Training
Binary labels indicate whether a page contains the answer span. The loss function sums binary cross-entropy over pages:
3. Adaptive Page Selection
AVIR’s selection algorithm adapts to document length and score distribution.
Long Documents ()
- Clustering: Apply k-means () on scores , yielding centroids .
- Relevant Cluster Formation: Pages assigned to the cluster with centroid form the candidate set .
- Top- Filtering: If (with ), select the top pages by descending .
Short Documents ()
A fixed relevance-probability threshold selects pages:
- If , choose .
- Otherwise, select all pages.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def AdaptivePageSelector(P, R, T=0.6, K_max=8): n = len(P) if n < 4: if max(R) >= T: return [p_i for p_i, r_i in zip(P, R) if r_i >= T] else: return P else: # k-means on R with k=2 c_i, mu_0, mu_1 = kmeans(R, k=2) R_rel = [p_i for p_i, c in zip(P, c_i) if c == (0 if mu_1 < mu_0 else 1)] if len(R_rel) > K_max: R_rel = sorted(R_rel, key=lambda p: R[P.index(p)], reverse=True)[:K_max] return R_rel |
4. LVLM Integration for Answer Generation
Selected pages (typically 2–4 per query) are rendered and packed into the input stream to a frozen Qwen2.5-VL-3B model, quantized with AWQ. The final input consists of the text question prefixed (“Answer the question based on the following pages. Question: …”) and pixel grids of each selected page, maintaining original reading order. The LVLM receives no parameter updates; all adaptation occurs in the retrieval component. Cross-modal attention in the LVLM enables contextual reasoning over the compressed document subset and answer generation proceeds via token-wise autoregressive decoding.
This suggests that the separation between retrieval and answer models can decouple computational costs for context expansion from model parameter adaptation.
5. Empirical Results and Benchmark Performance
AVIR’s effectiveness has been evaluated on MP-DocVQA, SlideVQA, and DUDE datasets, with notable reductions in processed pages and improvements in prediction accuracy metrics.
- MP-DocVQA
- Average processed pages per query: 2.5 (vs. 8.3 for end-to-end).
- Top-1 Page Prediction Accuracy: 81.55%.
- Average Normalized Levenshtein Similarity (ANLS): 0.8458.
- SlideVQA
- Average processed pages per query: 2.9 (vs. 20.0 for end-to-end).
- Exact Match (EM): 60.3%.
- F1: 68.9%.
- DUDE
- Overall ANLS: 0.4905 (Extractive: 0.6754, Abstractive: 0.6404).
Comparative Tables
MP-DocVQA Results
| Method | Params | Page-Pred (%) | ANLS |
|---|---|---|---|
| SelfAttnScoring | 273 M | 81.55 | 0.6199 |
| M3DOCRAG (8 B) | — | 81.05 | 0.8444 |
| Qwen2.5-3B-AWQ | — | — | 0.8405 |
| AVIR (ours) | 3 B | 81.55 | 0.8458 |
SlideVQA Results
| Model | Params | EM | F1 |
|---|---|---|---|
| Qwen2.5-3B-AWQ | 3 B | 56.6 | 65.8 |
| AVIR (ours) | 3 B | 60.3 | 68.9 |
Top-K Ablation (SlideVQA)
| Strategy | Avg Pages | EM | F1 |
|---|---|---|---|
| No Retrieval (20 pg) | 20.0 | 56.6 | 65.8 |
| Top-K=1 | 1.0 | 52.7 | 60.2 |
| Top-K=2 | 2.0 | 56.5 | 64.6 |
| Top-K=4 | 4.0 | 58.3 | 66.8 |
| Top-K=8 | 8.0 | 57.2 | 65.7 |
| APS (ours) | 2.9 | 60.3 | 68.9 |
A plausible implication is that adaptive selection strategies allow for more efficient context compression than naive top- methods while retaining or improving predictive accuracy.
6. Significance and Practical Considerations
AVIR’s design mitigates computational bottlenecks and challenges posed by long documents in visual question answering, notably reducing the cost of LVLM inference without model fine-tuning. The retrieval module’s light footprint (∼100M parameters over a frozen LVLM) means adding AVIR does not require retraining large-scale multimodal models. Empirical performance demonstrates that adaptively selected context windows outperform fixed-length alternatives, delivering robust gains on multi-page scientific and presentation-style document benchmarks.
The modularity and dataset-specific retrieval adaptation suggests AVIR is applicable to other visual document analysis tasks requiring selective context curation for downstream language reasoning.