DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703v1)

Published 12 Sep 2024 in cs.AI and cs.CL

Abstract: LLMs and Large Vision-LLMs (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces DSBench, a benchmark with 466 analysis tasks and 74 modeling tasks that mirror complex real-world data science challenges.
The paper presents the Relative Performance Gap (RPG) metric to standardize evaluation across diverse tasks, offering a unified framework.
The analysis of state-of-the-art models, including GPT-4, reveals these agents achieve only around 34% task success, underscoring areas for improvement.

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

The paper "DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?" by Liqiang Jing et al., introduces DSBench, a meticulously designed benchmark to evaluate the proficiency of data science agents in performing realistic data science tasks. The authors identify deficiencies in existing benchmarks and attempt to address these limitations by introducing tasks that more accurately reflect the complexities encountered in real-world data science endeavors.

Key Contributions

Comprehensive Benchmark Design:
- DSBench comprises 466 data analysis tasks and 74 data modeling tasks sourced from Eloquence and Kaggle competitions. The realistic scenarios incorporated in these tasks include handling long contexts, multimodal data, reasoning with large files and multi-table structures, and performing end-to-end data modeling tasks.
Introduction of Relative Performance Gap (RPG):
- The authors propose RPG to normalize performance metrics across various data modeling tasks to provide a unified evaluation framework.
Evaluation and Analysis:
- A detailed evaluation of state-of-the-art LLMs, LVLMs, and advanced agents shows that current models struggle significantly with the benchmark tasks. For instance, the best-performing agent only solves 34.12% of data analysis tasks and achieves a 34.74% RPG.

Task Categories and Formulation

Data Analysis Tasks:

These tasks involve answering questions that require a deep understanding of the provided data and the context of the questions. The dataset includes modalities such as text, images, and spreadsheets, making it more challenging.
Evaluation: Evaluation metrics for data analysis tasks include task-level accuracy and competition-level accuracy, using a semantics comparison function to match predicted answers with ground-truth answers.

Data Modeling Tasks:

These tasks assess the agent's ability to create machine learning models that can generalize and predict outcomes on unseen data.
Evaluation: Metrics include Task Success Rate and RPG, reflecting the agent's ability to finish tasks without errors and its performance on the generated models respectively.

Numerical Results and Implications

The numerical results highlight a significant gap between current capabilities of data science agents and expert human performance. Agents powered by leading models like GPT-4 achieve better performance than others but still fall short on most tasks. For instance, GPT-4 used in the AutoGen framework achieved a task-level accuracy of 34.12% on data analysis tasks and a 34.74% RPG on data modeling tasks.

Error Types:

Misinterpretation of Data
Inadequate Data Identification
Lack of Problem-Solving Strategy

These errors point towards areas where future improvements in data science agents can be targeted, particularly in enhancing the understanding and context analysis of the data.

Future Directions

Interactivity and Tool Use:

The integration of more sophisticated tools and interactive environments is crucial. Enhanced interactivity in frameworks like AutoGen shows promise but also underscores the need for improved integration and context-awareness.

Advanced Reasoning and Context Handling:

Future models must improve on handling long-context multimodal data more effectively. This will enable agents to perform complex data manipulations and model creations more consistently.

Practical Applications:

Improving data science agents based on insights from DSBench could significantly benefit various sectors, including finance, healthcare, and retail, where data-driven decision-making is paramount.

Theoretical Implications:

The benchmark sets a new standard for evaluating the holistic performance of data science agents, promoting a shift from simplistic benchmarks to more complex, real-world scenarios that push the boundaries of current AI research.

Conclusion

DSBench presents a critical step towards realistic evaluation of data science agents, covering a range of scenarios that true experts face. The insights drawn from this benchmark are invaluable for guiding future developments in creating more capable, autonomous data science agents. The proposed RPG metric and the comprehensive evaluation methodology provide a robust framework for assessing the progress of AI in tackling complex, real-world data science problems.

Related Papers

Tweets

https://twitter.com/wyu_nd/status/1835469391426973898

https://twitter.com/wyu_nd/status/1844482328426635559

https://twitter.com/jingliqiang6/status/1834773372078534797

https://twitter.com/theomitsa/status/1836893171910861205

https://twitter.com/Xinya16/status/1844896469431746594

https://twitter.com/arXivGPT/status/1835034445818081562

YouTube

Show All Videos