Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 41 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content (2410.10783v3)

Published 14 Oct 2024 in cs.CV

Abstract: The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

Citations (2)

View on Semantic Scholar

Collections

Summary

The paper introduces LiveXiv, a fully automated live benchmark that generates high-quality VQA and TQA pairs to overcome static data contamination.
It utilizes advanced methods like DeepSearch for data extraction and GPT-4o for question generation, ensuring diverse and precise multi-modal evaluation.
Evaluations on 17 large multi-modal models reveal significant performance insights, underscoring LiveXiv’s potential to set new standards in AI model assessment.

The paper introduces LiveXiv, a dynamic and automated multi-modal benchmark constructed from the content of scientific papers on the ArXiv platform. The proposed benchmark is a response to the challenges posed by traditional static benchmarks, which can suffer from test data contamination and fail to reflect the true capabilities of Large Multi-modal Models (LMMs). LiveXiv addresses these issues by continuously updating and expanding its dataset with new, uncontaminated data sourced directly from ArXiv, ensuring robust evaluation of LMMs without the interference of training data overlap.

Methodology

The creation of LiveXiv is fully automated, eliminating the need for human intervention. This process begins with the collection of scientific manuscripts from ArXiv, specifically those that have non-exclusive licenses for distribution, spanning multiple domains such as computer science, electrical engineering, and quantitative biology. The key steps involved are:

Data Acquisition and Pre-processing: Utilizing the DeepSearch toolkit, the content of the papers is parsed to extract figures, tables, and associated metadata. This forms the foundation for generating question-answer pairs.
VQA and TQA Generation: Visual Question Answering (VQA) and Table Question Answering (TQA) pairs are created by leveraging a multi-modal model, GPT-4o. This model generates detailed descriptions of images, from which questions are then crafted, ensuring they require multi-modal reasoning beyond simple text queries.
Filtering: To ensure the quality and challenge of the generated questions, an extensive filtering protocol is employed. It involves blind testing the LMMs to ensure that questions cannot be answered without visual context and verifying question accuracy through consensus between multiple models, reducing biases and errors.
Efficient Evaluation: The benchmark's evolving nature necessitates a method for efficient evaluation of model performance over time. Inspired by Item Response Theory (IRT), the paper proposes a method to approximate model performance on new data versions, thus reducing the cost and time traditionally required for comprehensive re-evaluation.

Results and Analysis

The paper demonstrates the benchmark's effectiveness by evaluating 17 LMMs, both open-sourced and proprietary, across two task categories: VQA and TQA. The results reveal significant insights into the models' performance, highlighting strengths and weaknesses across various scientific domains. Notably, models such as Claude-Sonnet achieve higher accuracy, suggesting specific architectural and methodological advantages.

The thorough analysis of performance across visual and textual question types points out critical challenges for LMMs, such as difficulty with arithmetic reasoning and certain types of visual content. The minimal variance between automated and manually verified data underscores the robustness of LiveXiv's generation and filtering processes.

Implications and Future Directions

LiveXiv sets a precedent for the future development of AI benchmarks by illustrating the feasibility and advantages of a live, evolving dataset free from contamination. This approach not only ensures more reliable evaluation of LMMs but also paves the way for ongoing improvement by continuously challenging models with fresh, domain-specific knowledge.

Moving forward, LiveXiv can be expanded to include other scientific archives, such as BioRxiv, broadening its domain scope. Furthermore, its efficient evaluation methodology can be applied to other dynamic datasets, promoting a more efficient setup for model assessment as the landscape of AI continues to evolve. This paper thus contributes a significant tool for refining the development and benchmarking of LMMs, offering a scalable and contamination-free alternative to traditional static benchmarks.