InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (2401.05507v3)

Published 10 Jan 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .

PDF HTML Abstract

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

The paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" presents a novel benchmark specifically designed to assess the capabilities of LLM-based agents (LLM-based agents) on tasks that involve comprehensive data analysis. This benchmark, InfiAgent-DABench, is distinctive for its focus on real-world data analysis challenges that require agents to interact with execution environments in an end-to-end manner.

Key Components of InfiAgent-DABench

InfiAgent-DABench consists of two main components:

DAEval Dataset: This dataset comprises 257 data analysis questions derived from 52 CSV files. The questions are transformed into a closed-form format to allow automatic evaluation. The generation of this dataset involved crawling CSV files from GitHub and using GPT-4 to create open-ended questions which were then closed using format-prompting techniques.
Agent Framework: An adaptable agent framework is provided to support LLMs in performing data analysis tasks. This framework demonstrates an agent's ability to plan, code, execute Python scripts, and derive conclusions, leveraging the ReAct (synergizing reasoning and acting) mechanism.

Methodology

The creators of InfiAgent-DABench adopted a meticulous methodology for constructing the DAEval dataset. They performed comprehensive human assessments to ensure the quality and accuracy of the dataset. Real-world CSV files served as the foundation for question generation, with key concepts identified through expert interviews guiding the nature of the questions. These questions were then converted to a closed-form format with precise constraints and answer formats, enabling straightforward evaluation without the need for subjective interpretation.

Evaluation Process

The benchmarking involved testing 34 state-of-the-art LLMs, revealing the limitations and challenges these models face in data analysis tasks. Particularly noteworthy is the development of DAAgent, a specialized data analysis agent built on top of the presented framework. The DAAgent outperformed GPT-3.5 by 3.9% on the DAEval dataset. This improvement is attributed to a specifically designed instruction-tuning dataset, DAInstruct, which aligns model training with practical data analysis tasks.

Numerical Results and Findings

A critical finding from the benchmarking process is that current models, even the highly capable GPT-4, still exhibit significant room for improvement. For instance, while GPT-4 achieved a leading performance with an accuracy of 78.99%, it highlighted that real-world data analysis tasks pose non-trivial challenges for LLMs. This underperformance of state-of-the-art models points to the need for further advancements in LLMs tailored for data analysis. Furthermore, the successful performance of DAAgent over GPT-3.5 underlines the efficacy of targeted instruction tuning.

Implications and Future Directions

The InfiAgent-DABench benchmark is poised to play a pivotal role in assessing the advancements of LLM-based agents in real-world data analysis applications. Its closed-form question format ensures objectivity and precision in evaluation, providing a reliable measure of an agent's capabilities. The introduction of a bespoke agent, DAAgent, additionally sets a precedent for the development of similarly specialized agents across other domains.

The research implies significant developments for both theoretical and practical applications of AI in data analysis. As agents powered by LLMs become increasingly integral to decision-making processes across industries, benchmarks like InfiAgent-DABench offer a crucial tool for progressing toward more reliable and capable AI systems. Future research might aim to extend the scope of such benchmarks to include more complex data interactions and multimodal data sources, as well as exploring novel architectures or training paradigms that could enhance the data analysis capabilities of LLMs.

Overall, the paper presents a thorough and methodologically robust benchmark that addresses a critical gap in the evaluation of LLM-based agents for practical data analysis, offering a structured and reproducible framework for future explorations in this burgeoning area of AI research.