Papers
Topics
Authors
Recent
2000 character limit reached

MicroVQA++: Advanced VQA for Microscopy Imaging

Updated 21 November 2025
  • The paper introduces MicroVQA++, a novel corpus built via expert-validated figure-caption pairs, graph-based quality filtering, and MLLM-driven MCQ generation to enhance scientific reasoning.
  • MicroVQA++ employs a heterogeneous HiCQA-Graph that fuses CLIP alignment, NLI entailment, and agent-derived signals, retaining the top 75% of QA pairs to reduce noise.
  • Experimental results demonstrate that supervised fine-tuning on MicroVQA++ enables open-source models to achieve state-of-the-art performance comparable to closed-source systems.

MicroVQA++ is a large-scale, high-quality visual question answering (VQA) corpus targeting advanced scientific reasoning over microscopy images. Developed to address the paucity of challenging benchmarks for multimodal LLMs (MLLMs) in biomedical imaging, MicroVQA++ introduces new methodologies for dataset construction, graph-based quality filtering, and rigorous evaluation, yielding a corpus with substantial size and difficulty surpassing prior resources such as MicroVQA (Li et al., 14 Nov 2025).

1. Dataset Construction Pipeline

MicroVQA++ is built through a three-stage process designed to maximize scale, relevance, and scientific rigor.

Stage 1: Bootstrapping from Expert-Validated Figure–Caption Pairs.

The corpus originates from BIOMEDICA (Lozano et al. 2025), a PubMed-derived archive containing approximately 24 million image–caption pairs, with ∼2.5M classified as microscopy images accompanied by expert-written captions. An MLLM agent extracts factual answer spans from each caption and generates questions that reference the paired image and fall into one of three reasoning "capacities": Expert Visual Understanding (EU), Hypothesis Generation (HG), or Experiment Proposal (EP). This stage produces approximately 30,000 QA pairs associated with ∼13,000 unique microscopy images.

Stage 2: HiCQA-Graph Filtering.

A heterogeneous graph is constructed containing nodes for images (I), captions (C), and QA pairs (Q), with edges encoding their relationships (e.g., description, support, similarity). The HiCQA-Graph architecture ranks QA pairs for cross-modal consistency through learned scoring, using signals from CLIP-based vision-language alignment, natural language inference (NLI) entailment, and agent-derived annotations. The top 75% of QA samples are retained (∼23,000 pairs over ∼10,000 images), eliminating a significant fraction of hallucinated or misaligned data.

Stage 3: MLLM-Driven MCQ Generation and Human Screening.

Each filtered (question, answer) pair is transformed into a four-option multiple-choice question (MCQ) by a second MLLM agent, which generates three distractors and a chain-of-thought (CoT) rationale. The dataset is split into a 20,000-example training set and a 6,000-example human-checked test set. All test items are exhaustively screened, and there is no overlap with MicroVQA.

2. HiCQA-Graph: Architecture and Filtering Mechanism

HiCQA-Graph is a heterogeneous graph structure defined as G=(V,E)G=(V,E) with disjoint node types V=ICQV=I\cup C\cup Q, representing images, captions, and QA items, respectively.

Node Features:

  • Image nodes: CLIP visual embedding viRfv_i \in \mathbb{R}^f; similarity with caption ciimgcap=cos(vi,ti)c^{img-cap}_i = \cos(v_i, t_i), normalized as c~iimgcap=(ciimgcap+1)/2[0,1]\tilde{c}^{img-cap}_i = (c^{img-cap}_i + 1)/2 \in [0,1]; feature vector xiimg=[vi;c~iimgcap]x^{img}_i = [v_i; \tilde{c}^{img-cap}_i].
  • Caption nodes: CLIP text embedding tjRft_j \in \mathbb{R}^f(long captions optionally summarized).
  • QA nodes: Embedding for concatenated question and answer qi,kq_{i,k}; similarity with image ci,kimgqa=cos(vi,qi,k)c^{img-qa}_{i,k} = \cos(v_i, q_{i,k}), normalized as c~i,kimgqa\tilde{c}^{img-qa}_{i,k}; feature xi,kqa=[qi,k;c~i,kimgqa]x^{qa}_{i,k} = [q_{i,k}; \tilde{c}^{img-qa}_{i,k}].

Edge Types:

  • (I→C) “describe_by” (unweighted)
  • (I→Q) “asked_about” (unweighted)
  • (C→Q) “supports” with NLI entailment probability pi,kent=P(captioni entails answeri,k)p^{ent}_{i,k} = P(caption_i \text{ entails } answer_{i,k})
  • (Q→Q) “similar” with attribute ai,(k)qaqa=max(0,cos(qi,k,qi,))a^{qa-qa}_{i,(k \to \ell)} = \max(0, \cos(q_{i,k}, q_{i,\ell}))

Supervision:

  • Each QA node has a capacity label yi,kcap{EU,HG,EP}y^{cap}_{i,k} \in \{EU, HG, EP\}.
  • The soft "keep" score used for filtering is yi,kkeep=αc~i,kimgqa+(1α)pi,kenty^{keep}_{i,k} = \alpha \tilde{c}^{img-qa}_{i,k} + (1-\alpha) p^{ent}_{i,k}, α[0,1]\alpha \in [0,1].

Graph Neural Network:

  • Feature projection into a common hidden space.
  • Two heterogeneous convolution layers:
    • On (I→C) and (I→Q) edges, GraphSAGE updates: hu(+1)=σ(Wselfhu()+WneighmeanvN(u)hv())h_u^{(\ell+1)} = \sigma(W_{\text{self}} h_u^{(\ell)} + W_{\text{neigh}} \cdot \text{mean}_{v\in N(u)} h_v^{(\ell)}).
    • On (C→Q) and (Q→Q) edges, GATv2: attention weights αuv\alpha_{uv} computed using softmax; outputs combined with LayerNorm and residuals.
  • Two MLP heads per QA node outputting logits for "keep" and "capacity" (cross-entropy loss).
  • Filtering is performed by retaining the top 75% of QA nodes ranked by “keep” probability.

3. Corpus Statistics and Quality Characteristics

MicroVQA++ exceeds existing microscopy-centric VQA datasets in both scale and reasoning diversity.

Split # MCQs # Unique Images Human-checked MCQ Format Chain of Thought
Training 20,000 8,594 No Yes Yes
Test 6,000 5,198 Yes Yes Yes

Bloom’s Taxonomy Distribution:

  • Level 1 (Remember): 10%
  • Level 2 (Understand/Apply): 45%
  • Level 3 (Analyze): 25%
  • Level 4 (Evaluate): 12%
  • Level 5 (Create): 8%

Compared to MicroVQA (1,042 Qs over 255 images, less than 15% at levels 3–5), MicroVQA++ exhibits increased size and complexity, with a larger proportion of items requiring higher-order scientific reasoning.

4. Experimental Protocols and Performance Evaluation

Models Evaluated:

  • Closed-source: GPT-4o, GPT-4o-mini, o1-series, Claude Sonnet 4.5, o4-mini, o3, GPT-5.
  • Open-source: LLaVA-Med-Mistral-7B, Qwen-2-VL-7B, InternVL3.5-2B-Instruct, InternVL3.5-4B-Instruct.

Fine-tuning (SFT) Details:

  • LoRA-rank 16, applied to all attention and MLP layers.
  • AdamW optimizer: learning rate 1e-4, weight decay 0.1, cosine schedule, 5% warm-up, 3 epochs, bfloat16 precision, gradient clipping at 1.0, context window 4096, batch 256.

Alternative Protocol:

Group Relative Policy Optimization (GRPO) with MCQs: learning rate 5e-6, KL coefficient 0.05, entropy bonus 0.001, value loss scale 0.5, 1 epoch.

Evaluation Metric:

Average MCQ accuracy, per capacity and aggregated.

Key Results:

Model Eval set Accuracy (%)
Random MicroVQA 22.0
Human MicroVQA 50.3
GPT-5 MicroVQA 59.4
o3 MicroVQA 59.3
InternVL3.5-4B (zero-shot) MicroVQA 46.6
InternVL3.5-4B (SFT) MicroVQA 59.4
InternVL3.5-2B (SFT) MicroVQA 54.5
GPT-4o-mini MicroVQA++ 37.3
LLaVA-Med-7B (SFT) MicroVQA++ 45.3
InternVL3.5-4B (SFT) MicroVQA++ 41.3

Filtering Ablation (InternVL3.5-4B, MicroVQA):

  • No filter: 58.2%
  • NLI only: ~57.6%
  • CLIP only: ~57.6%
  • NCLIP: 58.2%
  • HiCQA-Graph: 59.4% (+1.2% absolute)

This suggests that the HiCQA-Graph filtering mechanism confers measurable gains in effective model performance compared to single-modality or naively combined filters.

5. Principal Findings and Methodological Insights

  • Literature-grounded QA generation with graph-based filtering yields a microscopy VQA corpus (26,000 MCQs) that substantially surpasses prior datasets in both scale and cognitive challenge.
  • HiCQA-Graph fuses NLI entailment, CLIP-based alignment, and agent signals, effectively removing ∼25% of noisy or hallucinated data and delivering 1–4% absolute performance improvement for state-of-the-art open-source MLLMs.
  • After supervised fine-tuning on MicroVQA++, a 4B-parameter MLLM (InternVL3.5-4B) matches GPT-5 performance on MicroVQA, establishing a new open-model state of the art.
  • MCQ-format SFT is demonstrated to benefit model calibration and performance more reliably than free-form QA fine-tuning; further gains are possible using GRPO.
  • The dataset is explicitly designed to address higher-order scientific reasoning, evidenced by an increased share of questions in Bloom’s “Analyze/Create” categories.

6. Limitations and Directions for Future Work

MicroVQA++ remains dependent on the ability of an initial MLLM agent to extract QA pairs, risking potential incorporation of agent-induced biases or shortcuts from training data. Summarization is used to process overlong captions, which may truncate essential experimental details. The current design focuses on static MCQ format, with future iterations potentially extending to interactive, multimodal chain-of-thought reasoning, graph-based cross-sample filtering, and broader application to other biomedical imaging domains (Li et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MicroVQA++.