HiCQA-Graph: Filtration in Microscopy VQA
- The paper introduces HiCQA-Graph, a heterogeneous graph framework that enforces cross-modal consistency across images, captions, and QA pairs to enhance dataset quality.
- It employs GraphSAGE and GATv2 architectures alongside CLIP and NLI signals to filter out hallucinations and misalignments in generated QA samples.
- Empirical results demonstrate that HiCQA-Graph’s filtration process retains the top 75% of QA nodes, leading to significant performance gains in downstream multimodal AI models.
HiCQA-Graph is a heterogeneous graph-based framework designed to filter and curate large-scale microscopy visual question answering (VQA) corpora by enforcing joint cross-modal consistency across images, captions, and question–answer (QA) pairs. Introduced within the MicroVQA++ data-construction pipeline, HiCQA-Graph leverages natural language inference (NLI), vision-language alignment (CLIP), and reasoning agent signals for robust sample filtration, improving dataset quality for downstream multimodal LLM (MLLM) training and evaluation (Li et al., 14 Nov 2025).
1. Role in MicroVQA++ Data Pipeline
HiCQA-Graph constitutes the second phase of a three-stage data-construction pipeline for MicroVQA++:
- Bootstrapping Expertise: Image–caption pairs from the BIOMEDICA archive (~2.5M microscopy-type) are processed by an MLLM agent to extract initial, weakly supervised (image, QA) triplets, yielding an initial pool samples.
- Graph-based Filtration (HiCQA-Graph): The triplets are encoded within a heterogeneous graph structure, where modality-spanning consistency is evaluated and generator errors/hallucinations are removed.
- MCQ Generation and Human Annotation: Cleaned QA samples are extended into multiple-choice questions (MCQs) with distractors and rationale via an MLLM agent, with the test partition receiving human screening for error correction.
HiCQA-Graph specifically targets the removal of flawed or inconsistent (image, caption, QA) examples, ensuring downstream data is of high quality and suitable for training and benchmarking MLLMs in microscopy reasoning.
2. HiCQA-Graph Formalism and Components
HiCQA-Graph represents each microscopy sample using a heterogeneous graph where:
- Nodes ():
- : Microscopy image nodes
- : Caption nodes
- : QA nodes (per question–answer pair)
- Edge Types ():
- : "described_by"
- : "asked_about"
- : "supports" (NLI-based textual entailment)
- : "similar" (inter-QA similarity)
- Node Features:
- Image embedding: from CLIP ViT-L/14, augmented with the image–caption cosine similarity and normalized value .
- Caption embedding: via CLIP text encoder.
- QA embedding: with appended image–QA cosine statistic , normalized .
- Edge Attributes:
- Entailment (caption→QA):
- QA similarity (QA→QA):
3. Graph Neural Network Architecture and Filtering
The HiCQA-Graph uses a 2-layer HeteroConv GNN comprising:
- GraphSAGE on Image→{Caption, QA} relations:
- GATv2 on Caption→QA and QA↔QA edges with edge attributes:
The final output for each QA node () is split into two multi-layer perceptron (MLP) heads:
- (softmax: probability to keep vs. filter)
- (softmax: question capacity types)
The fused "keep" score for each QA node is:
where balances cross-modal alignment and textual entailment.
Training uses multi-task cross-entropy objectives over "keep" (binary label) and "capacity" (EU, HG, EP; 3-way). Filtration is performed by ranking all QA nodes by the predicted and retaining the top , producing the final cleaned dataset ( train and test samples).
4. Cross-Modal Consistency and Filtering Mechanisms
HiCQA-Graph uniquely operationalizes cross-modal consistency through integration of:
- CLIP-based vision–language alignment: direct image–caption and image–QA cosines flag visually or contextually misaligned QAs.
- NLI-based textual entailment: assesses logical support of each answer by the corresponding caption, via .
- QA–QA similarity: promotes QA diversity and redundancy mitigation via edgewise cosine similarity.
The simultaneous consideration of these signals enables explicit removal of generator hallucinations, question–answer noise, and caption misalignments that plague weakly supervised VQA data. This fused, graph-centric approach outperforms single-signal filtering strategies, with a reported performance gain of up to +2.5 percentage points on downstream benchmarks.
5. Empirical Results and Benchmark Impact
Application of HiCQA-Graph filtering to the MicroVQA++ development process yields a substantially expanded and cleaner microscopy reasoning corpus. Key results include:
- Dataset Scale: MicroVQA++ is approximately 19× larger (train Qs) and 5.8× larger (test Qs) than the predecessor MicroVQA. It provides 33.7× (train) and 20.4× (test) more images than MicroVQA.
- Bloom's Taxonomy Levels: MicroVQA++ exhibits a higher proportion of upper-level (4–6) questions (>50%) than MicroVQA (<50%), as measured by LLM-proxy Bloom’s scoring.
- Capacity Balance: The question type (capacity) distribution is approximately uniform across Expert Visual Understanding (EU), Hypothesis Generation (HG), and Experiment Proposal (EP), each at ≈33% of the corpus.
- Downstream Model Performance: After supervised fine-tuning on MicroVQA++ train, 4B-parameter open-source models (e.g., InternVL3.5-4B-Instruct) close the performance gap with GPT-5 (closed-source), matching its state-of-the-art accuracy on MicroVQA (59.4% average). On the more difficult MicroVQA++ test set (6K Qs, higher Bloom’s levels), such models see strong improvements (e.g., InternVL3.5-4B-Instruct: pre-SFT 36.4% → post-SFT 46.3% average; +10 pp).
| Model | MicroVQA++ Test (Avg) | MicroVQA Benchmark (Avg) |
|---|---|---|
| GPT-5 (closed) | – | 59.4% |
| InternVL3.5-4B-Instruct (pre) | 36.4% | 46.6% |
| InternVL3.5-4B-Instruct (post) | 46.3% | 59.4% |
This table summarizes performance improvements enabled by HiCQA-Graph curation as reported in (Li et al., 14 Nov 2025).
6. Limitations and Future Prospects
Identified constraints of HiCQA-Graph and the overall MicroVQA++ pipeline include:
- MLLM Agent Reliance: Systematic biases or hallucinations inherent in the question generation agent propagate into the dataset.
- Computational Overhead: NLI and CLIP computations require 110 ms per image, imposing scalability limitations.
- Bloom’s Taxonomy Assignment: Question-level cognitive depth is measured indirectly via LLM proxy rather than direct human judgment.
Outlined future directions are:
- Expanding human validation protocols beyond the test set.
- Introducing evaluation axes for free-form reasoning and explanation quality.
- Exploring unsupervised graph-based denoising strategies to avoid expensive NLI inference.
- Extension to further modalities (e.g., time-lapse, volumetric microscopy).
7. Significance within Multimodal Biomedical AI
HiCQA-Graph constitutes the first graph-based model that jointly encodes (image, caption, QA) for cross-modal consistency filtering in microscopy QA datasets. Its integration of NLI, CLIP, and agent-driven signals establishes a new benchmark for large-scale, quality-controlled medical VQA resource construction. The high-quality MicroVQA++ corpus curated via HiCQA-Graph enables state-of-the-art microscopy reasoning by both proprietary and open-source MLLMs at the 4B-parameter scale (Li et al., 14 Nov 2025). This streamlines rigorous evaluation and future development of interpretable, high-capacity multimodal AI for biomedical imaging.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free