SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification (2506.15569v1)

Published 18 Jun 2025 in cs.CL

Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

Collections

Summary

An Overview of SciVer: Evaluating Multimodal Foundation Models in Scientific Claim Verification

The paper "SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification" presents the design and evaluation of the SciVer benchmark, a pioneering framework for assessing the capacity of foundation models to verify scientific claims within a multimodal context. With the proliferation of scientific literature, effectively evaluating the ability of models to synthesize information across text, tables, and charts has become increasingly crucial. SciVer addresses a gap in existing benchmarks by providing a comprehensive tool that challenges state-of-the-art models in realistic scientific settings.

SciVer Benchmark Design and Characteristics

SciVer consists of 3,000 expert-annotated examples derived from 1,113 scientific papers in the domain of computer science. It is meticulously designed to reflect four distinct reasoning modalities: direct, parallel, sequential, and analytical reasoning. Each example includes expert-annotated supporting evidence, which allows for fine-grained evaluation of model performance. This comprehensive approach facilitates both verification tasks and a deeper understanding of the interconnected aspects of scientific content.

The paper's introduction of the benchmark entails an analysis of the performances of 21 multimodal foundation models, such as o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and others. A significant finding is the notable performance gap between machines and human experts, especially in complex reasoning tasks. For instance, the best-performing proprietary models like o4-mini achieved 77.7% accuracy, whereas human experts attained an accuracy of 93.8%.

Contributions and Experimental Analysis

The authors of the paper have made several significant contributions:

They introduced a high-quality benchmark for multimodal scientific claim verification that thoroughly tests models' reasoning across diverse scenarios.
They conducted an exhaustive evaluation on the capabilities and limitations of leading open-source and proprietary models. Their results illustrate the complexity of the SciVer tasks, with models performing near human-experts only in simple reasoning tasks.
The paper provides insights into the shortcomings of existing models. Through analyses such as retrieval-augmented generation (RAG) and Chain-of-Thought (CoT) reasoning, it identifies specific weaknesses like multi-step reasoning errors, heavy reliance on text, and domain-specific misconceptions.

Implications and Future Directions

The implications of this work are both practical and theoretical. Practically, SciVer offers a stringent benchmark, driving the advancement of foundation models in understanding scientific documents. The significant performance disparity between human experts and machines underscores the need for improvements in models' ability to handle complex, multimodal reasoning tasks.

Theoretically, the insights gained from the analysis could guide future research in enhancing model architectures for more sophisticated integration of diverse data types. The paper suggests exploring advanced retrieval mechanisms and incorporating domain-specific knowledge to mitigate the prevalent errors seen in current systems. Furthermore, the reliance on textual modality at the expense of visual or tabular data points to a need for refining models' multimodal integration capabilities.

In conclusion, SciVer stands as a vital tool for advancing the assessment and development of foundation models in scientific claim verification. Its explicit focus on multimodal reasoning challenges fosters the development of more robust and sophisticated AI systems, underscoring the complexity inherent in scientific literature. The paper not only presents a robust framework but also provides a blueprint for future innovations in the field.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/TheTuringPost/status/1937290660375331000

YouTube

Show All Videos