Papers
Topics
Authors
Recent
2000 character limit reached

HiCQA-Graph: Filtration in Microscopy VQA

Updated 21 November 2025
  • The paper introduces HiCQA-Graph, a heterogeneous graph framework that enforces cross-modal consistency across images, captions, and QA pairs to enhance dataset quality.
  • It employs GraphSAGE and GATv2 architectures alongside CLIP and NLI signals to filter out hallucinations and misalignments in generated QA samples.
  • Empirical results demonstrate that HiCQA-Graph’s filtration process retains the top 75% of QA nodes, leading to significant performance gains in downstream multimodal AI models.

HiCQA-Graph is a heterogeneous graph-based framework designed to filter and curate large-scale microscopy visual question answering (VQA) corpora by enforcing joint cross-modal consistency across images, captions, and question–answer (QA) pairs. Introduced within the MicroVQA++ data-construction pipeline, HiCQA-Graph leverages natural language inference (NLI), vision-language alignment (CLIP), and reasoning agent signals for robust sample filtration, improving dataset quality for downstream multimodal LLM (MLLM) training and evaluation (Li et al., 14 Nov 2025).

1. Role in MicroVQA++ Data Pipeline

HiCQA-Graph constitutes the second phase of a three-stage data-construction pipeline for MicroVQA++:

  1. Bootstrapping Expertise: Image–caption pairs from the BIOMEDICA archive (~2.5M microscopy-type) are processed by an MLLM agent to extract initial, weakly supervised (image, QA) triplets, yielding an initial pool Q026KQ_0 \approx 26\,\mathrm{K} samples.
  2. Graph-based Filtration (HiCQA-Graph): The triplets are encoded within a heterogeneous graph structure, where modality-spanning consistency is evaluated and generator errors/hallucinations are removed.
  3. MCQ Generation and Human Annotation: Cleaned QA samples are extended into multiple-choice questions (MCQs) with distractors and rationale via an MLLM agent, with the test partition receiving human screening for error correction.

HiCQA-Graph specifically targets the removal of flawed or inconsistent (image, caption, QA) examples, ensuring downstream data is of high quality and suitable for training and benchmarking MLLMs in microscopy reasoning.

2. HiCQA-Graph Formalism and Components

HiCQA-Graph represents each microscopy sample using a heterogeneous graph G=(V,E,I,T)G = (V, E, I, T) where:

  • Nodes (VV):
    • Vimg={vimgk}V_\mathrm{img} = \{v_\mathrm{img_k}\}: Microscopy image nodes
    • Vcap={vcapk}V_\mathrm{cap} = \{v_\mathrm{cap_k}\}: Caption nodes
    • Vqa={vqak,}V_\mathrm{qa} = \{v_\mathrm{qa_{k,\ell}}\}: QA nodes (per question–answer pair)
  • Edge Types (EE):
  1. vimgvcapv_\mathrm{img} \rightarrow v_\mathrm{cap}: "described_by"
  2. vimgvqav_\mathrm{img} \rightarrow v_\mathrm{qa}: "asked_about"
  3. vcapvqav_\mathrm{cap} \rightarrow v_\mathrm{qa}: "supports" (NLI-based textual entailment)
  4. vqavqav_\mathrm{qa} \rightarrow v_\mathrm{qa'}: "similar" (inter-QA similarity)
  • Node Features:
    • Image embedding: vkRf\mathbf{v}_k \in \mathbb{R}^f from CLIP ViT-L/14, augmented with the image–caption cosine similarity cimgcapkc_\mathrm{img-cap_k} and normalized value t~imgcapk=(cimgcapk+1)/2\tilde{t}_\mathrm{img-cap_k} = (c_\mathrm{img-cap_k} + 1)/2.
    • Caption embedding: tkRf\mathbf{t}_k \in \mathbb{R}^f via CLIP text encoder.
    • QA embedding: qk,Rf\mathbf{q}_{k,\ell} \in \mathbb{R}^f with appended image–QA cosine statistic cimgqak,c_\mathrm{img-qa_{k,\ell}}, normalized q~imgqak,\tilde{q}_\mathrm{img-qa_{k,\ell}}.
  • Edge Attributes:
    • Entailment (caption→QA): pk,ent=fNLI(ck,ak,)[0,1]p^\mathrm{ent}_{k,\ell} = f_\mathrm{NLI}(c_k, a_{k,\ell}) \in [0, 1]
    • QA similarity (QA→QA): ak,mqq=max{0,cos(qk,,qk,m)}a^{qq}_{k,\ell\to m} = \max\{0, \cos(\mathbf{q}_{k,\ell}, \mathbf{q}_{k,m})\}

3. Graph Neural Network Architecture and Filtering

The HiCQA-Graph uses a 2-layer HeteroConv GNN comprising:

  • GraphSAGE on Image→{Caption, QA} relations:

hu(+1)=ReLU(Wselfhu()+WneighmeanvN(u)hv())h_u^{(\ell+1)} = \operatorname{ReLU} \left( W_\mathrm{self} h_u^{(\ell)} + W_\mathrm{neigh} \cdot \operatorname{mean}_{v \in N(u)} h_v^{(\ell)} \right)

  • GATv2 on Caption→QA and QA↔QA edges with edge attributes:

αuvexp{aϕ(W[huhvweeuv])}\alpha_{uv} \propto \exp \left\{ a^\top \phi( W[h_u \parallel h_v \parallel w_e \cdot e_{uv}] ) \right\}

hu(+1)=ReLU(vN(u)αuvWhv())h_u^{(\ell+1)} = \operatorname{ReLU}\left(\sum_{v \in N(u)} \alpha_{uv} W h_v^{(\ell)}\right)

The final output for each QA node (hqaLh_\mathrm{qa}^L) is split into two multi-layer perceptron (MLP) heads:

  • zkeep=MLPkeep(hqaL)R2z_\mathrm{keep} = \mathrm{MLP}_\mathrm{keep}(h_\mathrm{qa}^L) \in \mathbb{R}^2 (softmax: probability to keep vs. filter)
  • zcap=MLPcap(hqaL)R3z_\mathrm{cap} = \mathrm{MLP}_\mathrm{cap}(h_\mathrm{qa}^L) \in \mathbb{R}^3 (softmax: question capacity types)

The fused "keep" score for each QA node is:

yk,keep=αq~imgqak,+(1α)pk,enty^\mathrm{keep}_{k,\ell} = \alpha \tilde{q}_\mathrm{img-qa_{k,\ell}} + (1-\alpha) p^\mathrm{ent}_{k,\ell}

where α[0,1]\alpha \in [0,1] balances cross-modal alignment and textual entailment.

Training uses multi-task cross-entropy objectives over "keep" (binary label) and "capacity" (EU, HG, EP; 3-way). Filtration is performed by ranking all QA nodes by the predicted P(keepvqa)P(\mathrm{keep}|v_\mathrm{qa}) and retaining the top τ=75%\tau=75\%, producing the final cleaned dataset (Q120KQ_1 \approx 20\,\mathrm{K} train and 6K6\,\mathrm{K} test samples).

4. Cross-Modal Consistency and Filtering Mechanisms

HiCQA-Graph uniquely operationalizes cross-modal consistency through integration of:

  • CLIP-based vision–language alignment: direct image–caption and image–QA cosines flag visually or contextually misaligned QAs.
  • NLI-based textual entailment: assesses logical support of each answer by the corresponding caption, via pk,entp^\mathrm{ent}_{k,\ell}.
  • QA–QA similarity: promotes QA diversity and redundancy mitigation via edgewise cosine similarity.

The simultaneous consideration of these signals enables explicit removal of generator hallucinations, question–answer noise, and caption misalignments that plague weakly supervised VQA data. This fused, graph-centric approach outperforms single-signal filtering strategies, with a reported performance gain of up to +2.5 percentage points on downstream benchmarks.

5. Empirical Results and Benchmark Impact

Application of HiCQA-Graph filtering to the MicroVQA++ development process yields a substantially expanded and cleaner microscopy reasoning corpus. Key results include:

  • Dataset Scale: MicroVQA++ is approximately 19× larger (train Qs) and 5.8× larger (test Qs) than the predecessor MicroVQA. It provides 33.7× (train) and 20.4× (test) more images than MicroVQA.
  • Bloom's Taxonomy Levels: MicroVQA++ exhibits a higher proportion of upper-level (4–6) questions (>50%) than MicroVQA (<50%), as measured by LLM-proxy Bloom’s scoring.
  • Capacity Balance: The question type (capacity) distribution is approximately uniform across Expert Visual Understanding (EU), Hypothesis Generation (HG), and Experiment Proposal (EP), each at ≈33% of the corpus.
  • Downstream Model Performance: After supervised fine-tuning on MicroVQA++ train, 4B-parameter open-source models (e.g., InternVL3.5-4B-Instruct) close the performance gap with GPT-5 (closed-source), matching its state-of-the-art accuracy on MicroVQA (59.4% average). On the more difficult MicroVQA++ test set (6K Qs, higher Bloom’s levels), such models see strong improvements (e.g., InternVL3.5-4B-Instruct: pre-SFT 36.4% → post-SFT 46.3% average; +10 pp).
Model MicroVQA++ Test (Avg) MicroVQA Benchmark (Avg)
GPT-5 (closed) 59.4%
InternVL3.5-4B-Instruct (pre) 36.4% 46.6%
InternVL3.5-4B-Instruct (post) 46.3% 59.4%

This table summarizes performance improvements enabled by HiCQA-Graph curation as reported in (Li et al., 14 Nov 2025).

6. Limitations and Future Prospects

Identified constraints of HiCQA-Graph and the overall MicroVQA++ pipeline include:

  • MLLM Agent Reliance: Systematic biases or hallucinations inherent in the question generation agent propagate into the dataset.
  • Computational Overhead: NLI and CLIP computations require \sim110 ms per image, imposing scalability limitations.
  • Bloom’s Taxonomy Assignment: Question-level cognitive depth is measured indirectly via LLM proxy rather than direct human judgment.

Outlined future directions are:

  • Expanding human validation protocols beyond the test set.
  • Introducing evaluation axes for free-form reasoning and explanation quality.
  • Exploring unsupervised graph-based denoising strategies to avoid expensive NLI inference.
  • Extension to further modalities (e.g., time-lapse, volumetric microscopy).

7. Significance within Multimodal Biomedical AI

HiCQA-Graph constitutes the first graph-based model that jointly encodes (image, caption, QA) for cross-modal consistency filtering in microscopy QA datasets. Its integration of NLI, CLIP, and agent-driven signals establishes a new benchmark for large-scale, quality-controlled medical VQA resource construction. The high-quality MicroVQA++ corpus curated via HiCQA-Graph enables state-of-the-art microscopy reasoning by both proprietary and open-source MLLMs at the 4B-parameter scale (Li et al., 14 Nov 2025). This streamlines rigorous evaluation and future development of interpretable, high-capacity multimodal AI for biomedical imaging.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HiCQA-Graph.