SlideVQA: Visual Question Answering
- SlideVQA is a multi-modal visual question answering task that integrates slide images, OCR content, and layout cues for holistic reasoning across presentation decks and gigapixel pathology images.
- It leverages advanced models such as Faster-R-CNN, LayoutLMv2, and vision transformers to extract and fuse visual and textual features for robust document and medical image analysis.
- Empirical results demonstrate improved evidence selection and answer accuracy, despite challenges in fine-grained OCR and complex multi-hop reasoning scenarios.
SlideVQA refers to a family of tasks, datasets, and models dedicated to visual question answering (VQA) on slide-based visual documents, including both standard presentation slides and gigapixel whole-slide pathology images. Across its usages, SlideVQA benchmarks target holistic reasoning over multiple modalities—highly structured visual layouts, text (both in natural language and as document elements), charts, figures, and sometimes even intricate biomedical features—requiring information extraction, logical reasoning, and occasionally numerical computation. Recent years have seen the introduction of prominent datasets and modeling approaches under the SlideVQA umbrella for both document understanding and medical imaging domains.
1. Definitions and Scope
SlideVQA, in its canonical form, is a VQA task where the system receives a slide image (or a set of slides/WSIs) and a natural language question related to that content, and must produce a discrete (often short) textual answer. The core challenge is the joint interpretation of visual layout, document OCR content, and the semantics of the question, sometimes involving multi-step or multi-image reasoning. Two principal domains are prevalent:
- Document SlideVQA: Operating on slide decks (e.g. PowerPoint, PDF) for information extraction, cross-slide evidence gathering, and question answering spanning textual, visual, and layout cues (Tanaka et al., 2023).
- Pathology SlideVQA (WSI-VQA, SlideBench-VQA, etc.): Tackling VQA on massive histopathology images, where the context size and visual diversity create unique information fusion and scaling problems (Chen et al., 2024, Chen et al., 2024).
2. Datasets and Annotation Protocols
2.1. Document SlideVQA Dataset
The SlideVQA dataset (Tanaka et al., 2023) aggregates 2.6k+ slide decks from SlideShare, comprising 52k+ individual slides and 14.5k multi-level annotated questions. It covers 39 high-level categories, providing comprehensive coverage in business, science, and education topics. Each slide is annotated with dense bounding boxes across nine semantically meaningful classes (title, page text, diagram, table, caption, object text, figure, image, other text). The questions are divided into:
- Single-hop: Answerable from a single slide.
- Multi-hop: Requiring inter-slide evidence aggregation.
- Numerical reasoning: Requiring arithmetic computation over extracted entities, with answers paired to explicit arithmetic expressions.
2.2. Pathology SlideVQA Datasets
WSI-VQA Dataset
The WSI-VQA corpus (Chen et al., 2024) comprises 8,672 question–answer pairs over 977 whole-slide images from TCGA-BRCA. Questions are distributed between multiple-choice (close-ended; 4,535) and open-ended (4,137). Annotations cover clinical and diagnostic entities: receptor status (ER, PR, HER2), histological subtype, margin status, grade, stage, survival, etc.
SlideInstruction and SlideBench-VQA
The SlideInstruction dataset (Chen et al., 2024) includes 4,181 WSIs (4,028 patients), 4,915 pathology reports, and 175,753 VQA pairs spanning 13 subcategories in microscopy, diagnosis, and clinical questions. SlideBench-VQA leverages these for comprehensive benchmarking, with explicit multi-class labeling and both closed- and open-ended answers.
| Dataset | Source | Images | QA Pairs | Domains/Categories |
|---|---|---|---|---|
| SlideVQA | SlideShare | 52k+ slides | 14.5k | 39 categories, business/science |
| WSI-VQA | TCGA | 977 WSIs | 8,672 | Pathology (cancer, clinical) |
| SlideInstruction | TCGA Reports | 4,181 WSIs | 175,753 | Microscopy, Diagnosis, Clinical |
3. Model Architectures and Training Methodologies
3.1. Document SlideVQA Modeling
The primary SlideVQA model (Tanaka et al., 2023) unifies evidence selection and question answering in a sequence-to-sequence framework, heavily leveraging modern layout-aware document models:
- Visual encoding: Faster-R-CNN (ResNet-101 backbone) extracts object-level region features per slide.
- Textual encoding: OCR-derived tokens are embedded via WordPiece or similar.
- Layout modeling: Normalized 2D coordinates provide geometric context, fused with text and visual features à la LayoutLMv2.
- Evidence selection: A binary classifier determines if a slide contains answer-relevant content.
- Document-level Transformer (H-LayoutLMv2): Aggregates slide [CLS] vectors for inter-image contextualization.
- Seq2seq answer generation: Standard Transformer decoder attends over selected evidence to output (a) textual answers, (b) LaTeX arithmetic expressions for numerical questions.
3.2. Pathology SlideVQA (WSI-VQA, SlideChat)
WSI-VQA
The Wsi2Text Transformer (W2T) (Chen et al., 2024) adapts the encoder–decoder Transformer paradigm for giga-pixel WSI input:
- Visual extractor: Frozen CNN or ViT (e.g., ResNet-50, ViT-S, DINO ViT-S, HIPT) to patch embeddings.
- Text encoder: PubMedBert/BioClinicalBert or end-to-end learned embeddings.
- Contextual Transformer layers: Multi-head self-attention over all patches followed by decoder-side cross-attention fusing question and WSI features.
- Answer generation: Word-by-word generative output; co-attention maps available for interpretability.
SlideChat
SlideChat (Chen et al., 2024) employs a two-stage architecture:
- Patch-level encoder: Frozen ViT-style image encoder (CONCH) on 224×224 patches.
- Slide-level encoder: LongNet with sparse attention over up to ∼10⁴ tokens.
- Multimodal projector: Linear mapping of visual tokens to LLM input space.
- LLM: Qwen2.5-7B-Instruct.
- Training: Two-stage; first for captioning (visual tokens only), then full VQA (visual tokens + question).
| Framework | Domain | Visual Encoder | Context Modeling | Answering |
|---|---|---|---|---|
| SlideVQA | Document | Faster-R-CNN | H-LayoutLMv2, FiD | Seq2seq, LaTeX |
| WSI-VQA | Pathology | CNN/ViT, frozen | Transformer, co-attn | Generative |
| SlideChat | Pathology | ViT, CONCH, LongNet | Slide-level, LLM | LLM output |
4. Evaluation Protocols and Empirical Results
4.1. Metrics
- Exact Match (EM): Binary accuracy, requiring string-exact answer.
- F1: Token overlap between predicted and gold answers.
- BLEU, ROUGE-L, METEOR: Fluency and overlap for generative outputs.
- Factₑₙₜ: Clinical entity-level match (WSI-VQA).
- QA accuracy: Proportion of answers correct (closed-set VQA).
- Object Detection AP: For document datasets by region type.
4.2. Document SlideVQA (2023)
On SlideVQA (Tanaka et al., 2023), best models achieve:
- EM ≈ 45% and F1 ≈ 52% (H-LayoutLMv2 + FiD), exceeding single-image baselines by 4–6 points.
- Evidence selection F1: ≈87.7% with robust OCR; substantial drop with poorer OCR (e.g., Tesseract).
- Object detection AP: Highest for “Title” (87%), “Page-text” (77%); lowest for “Obj-text” (30%).
A pronounced human–model gap persists: annotators reach EM ≈ 85%, F1 ≈ 90%, indicating model performance is ~35–40 points behind expert-level capability.
4.3. Pathology SlideVQA
WSI-VQA
- Closed-ended ACC: 43.3% (ResNet+Scratch), up to 50.8% with DINO ViT-S pretraining.
- Factₑₙₜ: 91.1–92.0% (clinical entity match).
- Task-specific F1: Subtyping 85.2%, PR-status 86.5%, survival regression c-index 59.7% (cf. best discriminative MIL baseline 80.6%).
SlideChat / SlideBench
- TCGA overall accuracy: 81.17% (SlideChat/Slide) (Chen et al., 2024).
- BCNB (zero-shot) accuracy: 54.15%.
- MedDr and GPT-4 baselines: 67.7% and 37.3% respectively.
- SlideChat achieves 68–88% on diagnostic sub-tasks, 74–91% in clinical categories; accuracy on ER/PR/HER2 in BCNB remains lower (e.g. HER2: 25%).
| Model | TCGA Overall ACC | BCNB ACC | Clinical F1 (WSI-VQA) |
|---|---|---|---|
| SlideChat | 81.2% | 54.1% | – |
| W2T (DINO+Scratch) | – | – | PR: 86.5%, sub: 85.2% |
| MedDr (baseline) | 67.7% | 33.7% | – |
5. Model Ablations, Limitations, and Failure Modes
5.1. Ablation Effects
Ablating slide-level encoders in SlideChat results in >10 percentage point drop on BCNB, emphasizing the necessity of global context aggregation in medical SlideVQA. DCI plug-in for LLaVA-NeXT-Interleave (Cuong et al., 13 Jun 2025), while beneficial for structured semantic benchmarks, reduces SlideVQA accuracy from 65.25% (FT) to 51.50% (FT w/ DCI), hypothesized to stem from loss of fine-grained OCR/textual cues vital for slide parsing.
5.2. Error Analysis
- Document SlideVQA: Numeric OCR confusions (“I0” vs. “10”), failure with >2-hop reasoning, ambiguous or poorly formatted tables, very small region text (Obj-text AP ~30%).
- Pathology SlideVQA: Lowered performance on BCNB HER2/PR-related questions; IHC specificity limitations; deployment constraints due to memory/throughput; generative models may hallucinate or produce plausible but incorrect answers.
- Overfitting risk: High-dimensional projection layers with limited training (as with DCI) may overfit on small SlideVQA datasets (Cuong et al., 13 Jun 2025).
A plausible implication is that models fusing mid-level vision features indiscriminately (rather than selectively attending to character/text-level details) may trade off structural/semantic coherence for fine-grained recognition, which is particularly damaging for numerically and textually intensive document SlideVQA.
6. Broader Implications and Future Directions
SlideVQA research demonstrates the feasibility and emerging success of multi-modal, multi-level question answering over slide-based documents and images for both general document analysis and digital pathology. The integration of OCR, vision transformers, LLMs, and attention-based context fusion underpins empirical advances across SlideVQA benchmarks. Notable limitations remain:
- Symbolic reasoning for explicit arithmetic and complex multi-hop chains is imperfect.
- Visual decoding granularity (trade-off between global context and fine-detail) may not be optimally balanced by current architectures.
- Scale and diversity of datasets remain a bottleneck for generalization, especially across institutions and rare clinical scenarios.
- Model explainability, interpretability, and regulatory robustness are essential for clinical adoption; attention maps and co-attention overlays are first steps but do not fully resolve black-box concerns.
Ongoing work is exploring:
- More robust multimodal pretraining across multiple slide types.
- Hybrid approaches combining sparse textual connectors with dense visual fusion.
- End-to-end trainable visual encoders and scaling to >1G token sequences.
- Lightweight or memory-optimized architectures for deployment beyond high-end computing environments.
SlideVQA's intersection across document intelligence, computer vision, natural language processing, and medical AI continues to drive advances in large-scale information extraction and user-facing VQA systems (Tanaka et al., 2023, Chen et al., 2024, Chen et al., 2024, Cuong et al., 13 Jun 2025).