MicroVQA++: Advanced VQA for Microscopy Imaging
- The paper introduces MicroVQA++, a novel corpus built via expert-validated figure-caption pairs, graph-based quality filtering, and MLLM-driven MCQ generation to enhance scientific reasoning.
- MicroVQA++ employs a heterogeneous HiCQA-Graph that fuses CLIP alignment, NLI entailment, and agent-derived signals, retaining the top 75% of QA pairs to reduce noise.
- Experimental results demonstrate that supervised fine-tuning on MicroVQA++ enables open-source models to achieve state-of-the-art performance comparable to closed-source systems.
MicroVQA++ is a large-scale, high-quality visual question answering (VQA) corpus targeting advanced scientific reasoning over microscopy images. Developed to address the paucity of challenging benchmarks for multimodal LLMs (MLLMs) in biomedical imaging, MicroVQA++ introduces new methodologies for dataset construction, graph-based quality filtering, and rigorous evaluation, yielding a corpus with substantial size and difficulty surpassing prior resources such as MicroVQA (Li et al., 14 Nov 2025).
1. Dataset Construction Pipeline
MicroVQA++ is built through a three-stage process designed to maximize scale, relevance, and scientific rigor.
Stage 1: Bootstrapping from Expert-Validated Figure–Caption Pairs.
The corpus originates from BIOMEDICA (Lozano et al. 2025), a PubMed-derived archive containing approximately 24 million image–caption pairs, with ∼2.5M classified as microscopy images accompanied by expert-written captions. An MLLM agent extracts factual answer spans from each caption and generates questions that reference the paired image and fall into one of three reasoning "capacities": Expert Visual Understanding (EU), Hypothesis Generation (HG), or Experiment Proposal (EP). This stage produces approximately 30,000 QA pairs associated with ∼13,000 unique microscopy images.
Stage 2: HiCQA-Graph Filtering.
A heterogeneous graph is constructed containing nodes for images (I), captions (C), and QA pairs (Q), with edges encoding their relationships (e.g., description, support, similarity). The HiCQA-Graph architecture ranks QA pairs for cross-modal consistency through learned scoring, using signals from CLIP-based vision-language alignment, natural language inference (NLI) entailment, and agent-derived annotations. The top 75% of QA samples are retained (∼23,000 pairs over ∼10,000 images), eliminating a significant fraction of hallucinated or misaligned data.
Stage 3: MLLM-Driven MCQ Generation and Human Screening.
Each filtered (question, answer) pair is transformed into a four-option multiple-choice question (MCQ) by a second MLLM agent, which generates three distractors and a chain-of-thought (CoT) rationale. The dataset is split into a 20,000-example training set and a 6,000-example human-checked test set. All test items are exhaustively screened, and there is no overlap with MicroVQA.
2. HiCQA-Graph: Architecture and Filtering Mechanism
HiCQA-Graph is a heterogeneous graph structure defined as with disjoint node types , representing images, captions, and QA items, respectively.
Node Features:
- Image nodes: CLIP visual embedding ; similarity with caption , normalized as ; feature vector .
- Caption nodes: CLIP text embedding (long captions optionally summarized).
- QA nodes: Embedding for concatenated question and answer ; similarity with image , normalized as ; feature .
Edge Types:
- (I→C) “describe_by” (unweighted)
- (I→Q) “asked_about” (unweighted)
- (C→Q) “supports” with NLI entailment probability
- (Q→Q) “similar” with attribute
Supervision:
- Each QA node has a capacity label .
- The soft "keep" score used for filtering is , .
Graph Neural Network:
- Feature projection into a common hidden space.
- Two heterogeneous convolution layers:
- On (I→C) and (I→Q) edges, GraphSAGE updates: .
- On (C→Q) and (Q→Q) edges, GATv2: attention weights computed using softmax; outputs combined with LayerNorm and residuals.
- Two MLP heads per QA node outputting logits for "keep" and "capacity" (cross-entropy loss).
- Filtering is performed by retaining the top 75% of QA nodes ranked by “keep” probability.
3. Corpus Statistics and Quality Characteristics
MicroVQA++ exceeds existing microscopy-centric VQA datasets in both scale and reasoning diversity.
| Split | # MCQs | # Unique Images | Human-checked | MCQ Format | Chain of Thought |
|---|---|---|---|---|---|
| Training | 20,000 | 8,594 | No | Yes | Yes |
| Test | 6,000 | 5,198 | Yes | Yes | Yes |
Bloom’s Taxonomy Distribution:
- Level 1 (Remember): 10%
- Level 2 (Understand/Apply): 45%
- Level 3 (Analyze): 25%
- Level 4 (Evaluate): 12%
- Level 5 (Create): 8%
Compared to MicroVQA (1,042 Qs over 255 images, less than 15% at levels 3–5), MicroVQA++ exhibits increased size and complexity, with a larger proportion of items requiring higher-order scientific reasoning.
4. Experimental Protocols and Performance Evaluation
Models Evaluated:
- Closed-source: GPT-4o, GPT-4o-mini, o1-series, Claude Sonnet 4.5, o4-mini, o3, GPT-5.
- Open-source: LLaVA-Med-Mistral-7B, Qwen-2-VL-7B, InternVL3.5-2B-Instruct, InternVL3.5-4B-Instruct.
Fine-tuning (SFT) Details:
- LoRA-rank 16, applied to all attention and MLP layers.
- AdamW optimizer: learning rate 1e-4, weight decay 0.1, cosine schedule, 5% warm-up, 3 epochs, bfloat16 precision, gradient clipping at 1.0, context window 4096, batch 256.
Alternative Protocol:
Group Relative Policy Optimization (GRPO) with MCQs: learning rate 5e-6, KL coefficient 0.05, entropy bonus 0.001, value loss scale 0.5, 1 epoch.
Evaluation Metric:
Average MCQ accuracy, per capacity and aggregated.
Key Results:
| Model | Eval set | Accuracy (%) |
|---|---|---|
| Random | MicroVQA | 22.0 |
| Human | MicroVQA | 50.3 |
| GPT-5 | MicroVQA | 59.4 |
| o3 | MicroVQA | 59.3 |
| InternVL3.5-4B (zero-shot) | MicroVQA | 46.6 |
| InternVL3.5-4B (SFT) | MicroVQA | 59.4 |
| InternVL3.5-2B (SFT) | MicroVQA | 54.5 |
| GPT-4o-mini | MicroVQA++ | 37.3 |
| LLaVA-Med-7B (SFT) | MicroVQA++ | 45.3 |
| InternVL3.5-4B (SFT) | MicroVQA++ | 41.3 |
Filtering Ablation (InternVL3.5-4B, MicroVQA):
- No filter: 58.2%
- NLI only: ~57.6%
- CLIP only: ~57.6%
- NCLIP: 58.2%
- HiCQA-Graph: 59.4% (+1.2% absolute)
This suggests that the HiCQA-Graph filtering mechanism confers measurable gains in effective model performance compared to single-modality or naively combined filters.
5. Principal Findings and Methodological Insights
- Literature-grounded QA generation with graph-based filtering yields a microscopy VQA corpus (26,000 MCQs) that substantially surpasses prior datasets in both scale and cognitive challenge.
- HiCQA-Graph fuses NLI entailment, CLIP-based alignment, and agent signals, effectively removing ∼25% of noisy or hallucinated data and delivering 1–4% absolute performance improvement for state-of-the-art open-source MLLMs.
- After supervised fine-tuning on MicroVQA++, a 4B-parameter MLLM (InternVL3.5-4B) matches GPT-5 performance on MicroVQA, establishing a new open-model state of the art.
- MCQ-format SFT is demonstrated to benefit model calibration and performance more reliably than free-form QA fine-tuning; further gains are possible using GRPO.
- The dataset is explicitly designed to address higher-order scientific reasoning, evidenced by an increased share of questions in Bloom’s “Analyze/Create” categories.
6. Limitations and Directions for Future Work
MicroVQA++ remains dependent on the ability of an initial MLLM agent to extract QA pairs, risking potential incorporation of agent-induced biases or shortcuts from training data. Summarization is used to process overlong captions, which may truncate essential experimental details. The current design focuses on static MCQ format, with future iterations potentially extending to interactive, multimodal chain-of-thought reasoning, graph-based cross-sample filtering, and broader application to other biomedical imaging domains (Li et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free