MicroVQA++: Advanced VQA for Microscopy Imaging

Updated 21 November 2025

The paper introduces MicroVQA++, a novel corpus built via expert-validated figure-caption pairs, graph-based quality filtering, and MLLM-driven MCQ generation to enhance scientific reasoning.
MicroVQA++ employs a heterogeneous HiCQA-Graph that fuses CLIP alignment, NLI entailment, and agent-derived signals, retaining the top 75% of QA pairs to reduce noise.
Experimental results demonstrate that supervised fine-tuning on MicroVQA++ enables open-source models to achieve state-of-the-art performance comparable to closed-source systems.

MicroVQA++ is a large-scale, high-quality visual question answering (VQA) corpus targeting advanced scientific reasoning over microscopy images. Developed to address the paucity of challenging benchmarks for multimodal LLMs (MLLMs) in biomedical imaging, MicroVQA++ introduces new methodologies for dataset construction, graph-based quality filtering, and rigorous evaluation, yielding a corpus with substantial size and difficulty surpassing prior resources such as MicroVQA (Li et al., 14 Nov 2025).

1. Dataset Construction Pipeline

MicroVQA++ is built through a three-stage process designed to maximize scale, relevance, and scientific rigor.

Stage 1: Bootstrapping from Expert-Validated Figure–Caption Pairs.

The corpus originates from BIOMEDICA (Lozano et al. 2025), a PubMed-derived archive containing approximately 24 million image–caption pairs, with ∼2.5M classified as microscopy images accompanied by expert-written captions. An MLLM agent extracts factual answer spans from each caption and generates questions that reference the paired image and fall into one of three reasoning "capacities": Expert Visual Understanding (EU), Hypothesis Generation (HG), or Experiment Proposal (EP). This stage produces approximately 30,000 QA pairs associated with ∼13,000 unique microscopy images.

Stage 2: HiCQA-Graph Filtering.

A heterogeneous graph is constructed containing nodes for images (I), captions (C), and QA pairs (Q), with edges encoding their relationships (e.g., description, support, similarity). The HiCQA-Graph architecture ranks QA pairs for cross-modal consistency through learned scoring, using signals from CLIP-based vision-language alignment, natural language inference (NLI) entailment, and agent-derived annotations. The top 75% of QA samples are retained (∼23,000 pairs over ∼10,000 images), eliminating a significant fraction of hallucinated or misaligned data.

Stage 3: MLLM-Driven MCQ Generation and Human Screening.

Each filtered (question, answer) pair is transformed into a four-option multiple-choice question (MCQ) by a second MLLM agent, which generates three distractors and a chain-of-thought (CoT) rationale. The dataset is split into a 20,000-example training set and a 6,000-example human-checked test set. All test items are exhaustively screened, and there is no overlap with MicroVQA.

2. HiCQA-Graph: Architecture and Filtering Mechanism

HiCQA-Graph is a heterogeneous graph structure defined as $G=(V,E)$ with disjoint node types $V=I\cup C\cup Q$ , representing images, captions, and QA items, respectively.

Node Features:

Image nodes: CLIP visual embedding $v_i \in \mathbb{R}^f$ ; similarity with caption $c^{img-cap}_i = \cos(v_i, t_i)$ , normalized as $\tilde{c}^{img-cap}_i = (c^{img-cap}_i + 1)/2 \in [0,1]$ ; feature vector $x^{img}_i = [v_i; \tilde{c}^{img-cap}_i]$ .
Caption nodes: CLIP text embedding $t_j \in \mathbb{R}^f$ (long captions optionally summarized).
QA nodes: Embedding for concatenated question and answer $q_{i,k}$ ; similarity with image $c^{img-qa}_{i,k} = \cos(v_i, q_{i,k})$ , normalized as $\tilde{c}^{img-qa}_{i,k}$ ; feature $x^{qa}_{i,k} = [q_{i,k}; \tilde{c}^{img-qa}_{i,k}]$ .

Edge Types:

(I→C) “describe_by” (unweighted)
(I→Q) “asked_about” (unweighted)
(C→Q) “supports” with NLI entailment probability $p^{ent}_{i,k} = P(caption_i \text{ entails } answer_{i,k})$
(Q→Q) “similar” with attribute $a^{qa-qa}_{i,(k \to \ell)} = \max(0, \cos(q_{i,k}, q_{i,\ell}))$

Supervision:

Each QA node has a capacity label $y^{cap}_{i,k} \in \{EU, HG, EP\}$ .
The soft "keep" score used for filtering is $y^{keep}_{i,k} = \alpha \tilde{c}^{img-qa}_{i,k} + (1-\alpha) p^{ent}_{i,k}$ , $\alpha \in [0,1]$ .

Graph Neural Network:

Feature projection into a common hidden space.
Two heterogeneous convolution layers:
- On (I→C) and (I→Q) edges, GraphSAGE updates: $h_u^{(\ell+1)} = \sigma(W_{\text{self}} h_u^{(\ell)} + W_{\text{neigh}} \cdot \text{mean}_{v\in N(u)} h_v^{(\ell)})$ .
- On (C→Q) and (Q→Q) edges, GATv2: attention weights $\alpha_{uv}$ computed using softmax; outputs combined with LayerNorm and residuals.
Two MLP heads per QA node outputting logits for "keep" and "capacity" (cross-entropy loss).
Filtering is performed by retaining the top 75% of QA nodes ranked by “keep” probability.

3. Corpus Statistics and Quality Characteristics

MicroVQA++ exceeds existing microscopy-centric VQA datasets in both scale and reasoning diversity.

Split	# MCQs	# Unique Images	Human-checked	MCQ Format	Chain of Thought
Training	20,000	8,594	No	Yes	Yes
Test	6,000	5,198	Yes	Yes	Yes

Bloom’s Taxonomy Distribution:

Level 1 (Remember): 10%
Level 2 (Understand/Apply): 45%
Level 3 (Analyze): 25%
Level 4 (Evaluate): 12%
Level 5 (Create): 8%

Compared to MicroVQA (1,042 Qs over 255 images, less than 15% at levels 3–5), MicroVQA++ exhibits increased size and complexity, with a larger proportion of items requiring higher-order scientific reasoning.

4. Experimental Protocols and Performance Evaluation

Models Evaluated:

Closed-source: GPT-4o, GPT-4o-mini, o1-series, Claude Sonnet 4.5, o4-mini, o3, GPT-5.
Open-source: LLaVA-Med-Mistral-7B, Qwen-2-VL-7B, InternVL3.5-2B-Instruct, InternVL3.5-4B-Instruct.

Fine-tuning (SFT) Details:

LoRA-rank 16, applied to all attention and MLP layers.
AdamW optimizer: learning rate 1e-4, weight decay 0.1, cosine schedule, 5% warm-up, 3 epochs, bfloat16 precision, gradient clipping at 1.0, context window 4096, batch 256.

Alternative Protocol:

Group Relative Policy Optimization (GRPO) with MCQs: learning rate 5e-6, KL coefficient 0.05, entropy bonus 0.001, value loss scale 0.5, 1 epoch.

Evaluation Metric:

Average MCQ accuracy, per capacity and aggregated.

Key Results:

Model	Eval set	Accuracy (%)
Random	MicroVQA	22.0
Human	MicroVQA	50.3
GPT-5	MicroVQA	59.4
o3	MicroVQA	59.3
InternVL3.5-4B (zero-shot)	MicroVQA	46.6
InternVL3.5-4B (SFT)	MicroVQA	59.4
InternVL3.5-2B (SFT)	MicroVQA	54.5
GPT-4o-mini	MicroVQA++	37.3
LLaVA-Med-7B (SFT)	MicroVQA++	45.3
InternVL3.5-4B (SFT)	MicroVQA++	41.3

Filtering Ablation (InternVL3.5-4B, MicroVQA):

No filter: 58.2%
NLI only: ~57.6%
CLIP only: ~57.6%
NCLIP: 58.2%
HiCQA-Graph: 59.4% (+1.2% absolute)

This suggests that the HiCQA-Graph filtering mechanism confers measurable gains in effective model performance compared to single-modality or naively combined filters.

5. Principal Findings and Methodological Insights

Literature-grounded QA generation with graph-based filtering yields a microscopy VQA corpus (26,000 MCQs) that substantially surpasses prior datasets in both scale and cognitive challenge.
HiCQA-Graph fuses NLI entailment, CLIP-based alignment, and agent signals, effectively removing ∼25% of noisy or hallucinated data and delivering 1–4% absolute performance improvement for state-of-the-art open-source MLLMs.
After supervised fine-tuning on MicroVQA++, a 4B-parameter MLLM (InternVL3.5-4B) matches GPT-5 performance on MicroVQA, establishing a new open-model state of the art.
MCQ-format SFT is demonstrated to benefit model calibration and performance more reliably than free-form QA fine-tuning; further gains are possible using GRPO.
The dataset is explicitly designed to address higher-order scientific reasoning, evidenced by an increased share of questions in Bloom’s “Analyze/Create” categories.

6. Limitations and Directions for Future Work

MicroVQA++ remains dependent on the ability of an initial MLLM agent to extract QA pairs, risking potential incorporation of agent-induced biases or shortcuts from training data. Summarization is used to process overlong captions, which may truncate essential experimental details. The current design focuses on static MCQ format, with future iterations potentially extending to interactive, multimodal chain-of-thought reasoning, graph-based cross-sample filtering, and broader application to other biomedical imaging domains (Li et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MicroVQA++.