Eureka: Evaluating and Understanding Large Foundation Models (2409.10566v1)

Published 13 Sep 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

PDF Abstract

Evaluation and Understanding of Large Foundation Models with Eureka

In the rapidly advancing domain of Large Foundation Models (LFMs), the evaluation process has become increasingly intricate due to issues like benchmark saturation, lack of transparency in the methods used for assessment, and the extensive variety of capabilities that need to be evaluated for a holistic comparison. The paper “Eureka: Evaluating and Understanding Large Foundation Models” by Balachandran et al. addresses these challenges by introducing a comprehensive evaluation framework and benchmark suite specifically designed for rigorous and reproducible assessment of LFMs.

Contributions of the Paper

The authors make three primary contributions:

Eureka Framework: An open-source, reusable evaluation framework that provides standardized evaluations of LFMs, moving beyond simplistic single-score reports and leaderboards. Eureka allows for a flexible composition of evaluation pipelines, incorporating components like data preprocessing, prompt templates, model inference, data postprocessing, metric computation, and reporting.
Eureka-Bench: A diverse collection of benchmarks targeting areas where LFMs struggle and capabilities that have been overlooked in traditional evaluations, spanning both language and vision modalities. The benchmarks cover geometric reasoning, multimodal question answering, spatial reasoning, object detection, and a host of language tasks like instruction following, information retrieval, long-context reasoning, and toxicity detection.
Granular Analysis and Insights: Using Eureka and Eureka-Bench, the authors conduct an in-depth analysis of 12 state-of-the-art models. This includes disaggregating measurements across important experimental conditions and subcategories of data, offering detailed insights into specific weaknesses and strengths of the models.

Methodological and Practical Implications

Evaluation Framework

In the context of complex, generative capabilities, traditional fixed, closed-form metric definitions are insufficient. Eureka addresses this with a library for creating shareable evaluation pipelines, enabling transparent and reproducible experiments. This framework supports both offline and online evaluations and allows practitioners to maintain detailed logs of each experiment for subsequent analysis.

Benchmark Selection

The benchmarks included in Eureka-Bench are selected to avoid the pitfalls of benchmark saturation. The benchmarks either remain challenging for most current models or test fundamental capabilities that are often overlooked. This judicious selection ensures that the benchmarks provide meaningful insights into the models' capabilities and uncover areas where significant improvements are needed.

Results Summary

The paper’s analysis reveals several key findings across both multimodal and language evaluations:

Multimodal Evaluation

Geometric Reasoning (GeoMeter): Models like Claude 3.5 Sonnet and Gemini 1.5 Pro demonstrate notable performance in depth and height reasoning tasks, although overall accuracy remains modest (~50%). This indicates a broader challenge in geometric reasoning that demands further advancements.
Object Recognition, Detection, and Spatial Reasoning: While models show competence in basic object recognition, they struggle with detailed image understanding tasks like object detection and spatial reasoning.
Multimodal Question Answering (MMMU): Models like GPT-4o 2024-05-13 and Claude 3.5 Sonnet perform well, yet overall accuracy remains around the mid-60% range, highlighting the room for growth in integrating multimodal information.

Language Evaluation

Instruction Following (IFEval): Most advanced models exceed 75% accuracy in instruction following. However, performance varies significantly across instruction types, with length constraints and keyword instructions being particularly challenging.
Long Context QA (FlenQA): Models exhibit performance degradation as context length grows, with Llama 3.1 405B and GPT-4o 2024-05-13 showing the least drop, indicating better long-context handling.
Information Retrieval (Kitab): Current models exhibit limited ability to generate factually accurate long-form outputs, especially under constraint conditions. The best models achieve less than 55% constraint satisfaction rate without context.
Toxicity Detection and Safe Language Generation (Toxigen): Models like GPT-4o 2024-05-13 balance high toxicity detection accuracy with low generative toxicity. However, significant demographic discrepancies in toxicity detection accuracy persist across several models.

Non-Determinism and Backward Compatibility

The analysis also includes an investigation into the non-deterministic nature of model outputs, finding that models like Gemini 1.5 Pro and GPT-4 Vision Preview exhibit high non-determinism, whereas others like Llama 3.1 70B and Mistral Large 2407 are relatively deterministic. Additionally, backward compatibility within model families reveals substantial regress across updates, posing challenges for consistent model deployment in practical applications.

Conclusion

Eureka and its benchmark suite, Eureka-Bench, present a structured and comprehensive approach to evaluating LFMs. The detailed disaggregated insights provided by the framework are crucial for understanding the specific strengths and weaknesses of various models, guiding future developments in AI. The transparent, reproducible nature of Eureka aims to foster collaboration in the AI community, pushing the boundaries of what LFMs can achieve while ensuring consistent, reliable assessments. Future work will likely expand Eureka-Bench to include more diverse and challenging benchmarks, particularly in multilingual, safety, and real-world interactive settings, addressing the evolving landscape of AI capabilities.