Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
99 tokens/sec
Gemini 2.5 Pro Premium
56 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
106 tokens/sec
DeepSeek R1 via Azure Premium
99 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

MArxivAgent Pipeline for VQA in Materials Science

Updated 11 August 2025
  • MArxivAgent pipeline is an automated research data generation system that converts arXiv materials science papers into multimodal VQA tasks using structured, ontology-based reasoning.
  • It employs granular extraction, SPP chain extraction, and a two-stage shortcut elimination process to ensure questions require genuine cross-modal analysis.
  • Evaluation on the MatVQA dataset reveals significant performance gaps in current multimodal models, emphasizing the need for enhanced visual and scientific reasoning capabilities.

The MArxivAgent pipeline is an automated research data generation system specifically devised for rigorous multimodal scientific reasoning benchmark construction in materials science. Its architecture encompasses granular extraction and transformation of literature-sourced data—figures, captions, contextual text—into multiple-choice visual question-answering (VQA) tasks that demand cross-modal reasoning, high-fidelity visual perception, and strict avoidance of linguistic shortcuts. The resulting MatVQA dataset catalyzes evaluation and development of Multimodal LLMs (MLLMs) for research-level analytic challenges not addressed by previous text-centric datasets.

1. Automated Pipeline Architecture

The MArxivAgent pipeline is structured to programmatically extract, process, and validate information from recent materials science manuscripts (e.g., 500 arXiv papers from 2024). Major sequential components are:

  • Data Extraction: Marker tools parse PDFs to identify figures, captions, and local narrative context. This captures visual data and the associated scientific explanation required for scientific reasoning.
  • Verifiable Reasoning Path Extraction: For each figure and caption, the pipeline extracts causal chains according to domain-specific ontologies (MatOnto). Scientific reasoning chains are structured as transitions along Structure (S), Property (P), Performance (Pe), Processing (Pr), and Environment (E) dimensions. For example:

10 T magnetic field (E)collapse of Bragg satellites (S)suppression of spin-cycloid (P)redistribution of scattering intensity (Pe)\text{10 T magnetic field (E)} \rightarrow \text{collapse of Bragg satellites (S)} \rightarrow \text{suppression of spin-cycloid (P)} \rightarrow \text{redistribution of scattering intensity (Pe)}

Each chain is directly referenced and validated against source text using algorithmic matching.

  • MCQ Construction: Using the reasoning path and extracted visual/contextual data, the pipeline generates varying MCQ types—causal, comparative, hypothetical, quantitative—by combining figure analysis requirements with domain-specific distractors.
  • Shortcut Removal: An iterative two-stage process eliminates "language shortcuts" and "caption shortcuts." In stage 1, evaluators attempt to answer the question without viewing the figure; if successful, the question is automatically rewritten and rechecked. Stage 2 repeats this process with captions only, enforcing that the vision component is indispensable for correct response.

2. Dataset Generation and Item Characteristics

The MatVQA dataset comprises 1,325 MCQs generated entirely from genuine materials science literature, with each item validated for scientific integrity and reasoning depth.

  • Task Types: The dataset covers four reasoning tasks—Quantitative SPP, Comparative SPP, Causal SPP (950 items), and Hypothetical Variation.
  • Reasoning Depth: Every item is built around a multi-hop reasoning chain spanning scientific concepts. Correct solutions demand integration of low-level visual evidence (e.g., microscopy images, diffraction patterns) and nuanced domain knowledge.
  • Visual Integration: Figures are central; items explicitly require analysis of image morphology, contrast, or quantification (e.g., spot counting, lattice fringe analysis).
  • Quality Assurance: Items are randomly audited (20%) by domain experts and algorithmically verified against extracted scientific reasoning paths, ensuring robust mapping to the original literature.

3. Iterative Refinement and Shortcut Elimination

To ensure multimodal rigor and prevent models from exploiting textual artifacts, MArxivAgent implements a two-stage shortcut elimination protocol:

  • Language Shortcut Removal (Stage 1): An automated evaluator is tasked with solving the MCQ using only text outside of figures. If successful, a rewriting agent and semantic checker modify the question, removing cues until Figure analysis becomes required.
  • Caption Shortcut Removal (Stage 2): A similar process uses only figure captions. The MCQ is reformulated so that neither caption nor text suffices—correct answer demands direct visual analysis.

This maximizes the necessity for grounded cross-modal reasoning and disables simple text-based inference strategies.

4. Benchmarking MLLMs on MatVQA

Evaluation of the dataset is performed with 17 open- and closed-source MLLMs. Salient details include:

  • Prompting Protocol: Chain-of-thought prompting is used to encourage multi-step scientific reasoning; evaluation is split by reasoning task.
  • Performance Findings: Even state-of-the-art models (e.g., Claude 3.7 Sonnet, 51.9% accuracy) show large gaps versus human performance, with accuracy drops after shortcut removal highlighting the increased challenge.
  • Difficulty Escalation: The two-stage refinement process demonstrates sharp declines in model accuracy as shortcut cues are removed, confirming that tasks demand genuine multimodal scientific inference.

5. Reasoning Chain Structure and Scientific Grounding

Each MatVQA item is scientifically grounded using the extracted SPP chain:

  • SPP Chains: Reasoning from environmental stimulus (E) through structural change (S) to material property (P) and performance outcome (Pe) is programmatically extracted and validated against ontology terms.
  • Integration of Modalities: Final MCQs require not only correct domain understanding but also direct analysis of visual granularities—models need to reconcile pixel-level features with underlying physical and chemical principles.

6. Implications for Model Evaluation and Future Research

The MArxivAgent pipeline and MatVQA benchmark have several implications for advanced multimodal AI:

  • Benchmark Advancement: By targeting rigorous, visually-grounded scientific reasoning with shortcut disabling, MatVQA challenges existing MLLMs far beyond text benchmarks, providing diagnostic insight into current deficiencies.
  • Scalability: The pipeline is fully automated and designed for scalability (with expansion to ∼12,000 questions planned), easily extensible to other scientific domains where multimodal reasoning is essential.
  • Methodological Upgrades: A plausible implication is that models must improve at cross-modal feature extraction, semantic grounding of visual concepts, and mechanistic reasoning—not just LLMing.
  • Scientific Discovery Tools: The real-world nature of tasks (e.g., novel materials characterization, image-based discovery) positions MatVQA and MArxivAgent as blueprints for future AI systems intended to augment scientific research.

In summary, the MArxivAgent pipeline automates the generation of complex, visually demanding, scientifically validated VQA items from arXiv literature. Its unique two-stage shortcut elimination, SPP reasoning path extraction, and integration of domain expert validation collectively advance the construction of challenging benchmarks such as MatVQA, revealing substantial gaps in present-day multimodal model capability and providing a robust foundation for future cross-domain scientific AI evaluation and development (Wu et al., 23 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)