Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (2401.05507v3)

Published 10 Jan 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

The paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" presents a novel benchmark specifically designed to assess the capabilities of LLM-based agents (LLM-based agents) on tasks that involve comprehensive data analysis. This benchmark, InfiAgent-DABench, is distinctive for its focus on real-world data analysis challenges that require agents to interact with execution environments in an end-to-end manner.

Key Components of InfiAgent-DABench

InfiAgent-DABench consists of two main components:

  1. DAEval Dataset: This dataset comprises 257 data analysis questions derived from 52 CSV files. The questions are transformed into a closed-form format to allow automatic evaluation. The generation of this dataset involved crawling CSV files from GitHub and using GPT-4 to create open-ended questions which were then closed using format-prompting techniques.
  2. Agent Framework: An adaptable agent framework is provided to support LLMs in performing data analysis tasks. This framework demonstrates an agent's ability to plan, code, execute Python scripts, and derive conclusions, leveraging the ReAct (synergizing reasoning and acting) mechanism.

Methodology

The creators of InfiAgent-DABench adopted a meticulous methodology for constructing the DAEval dataset. They performed comprehensive human assessments to ensure the quality and accuracy of the dataset. Real-world CSV files served as the foundation for question generation, with key concepts identified through expert interviews guiding the nature of the questions. These questions were then converted to a closed-form format with precise constraints and answer formats, enabling straightforward evaluation without the need for subjective interpretation.

Evaluation Process

The benchmarking involved testing 34 state-of-the-art LLMs, revealing the limitations and challenges these models face in data analysis tasks. Particularly noteworthy is the development of DAAgent, a specialized data analysis agent built on top of the presented framework. The DAAgent outperformed GPT-3.5 by 3.9% on the DAEval dataset. This improvement is attributed to a specifically designed instruction-tuning dataset, DAInstruct, which aligns model training with practical data analysis tasks.

Numerical Results and Findings

A critical finding from the benchmarking process is that current models, even the highly capable GPT-4, still exhibit significant room for improvement. For instance, while GPT-4 achieved a leading performance with an accuracy of 78.99%, it highlighted that real-world data analysis tasks pose non-trivial challenges for LLMs. This underperformance of state-of-the-art models points to the need for further advancements in LLMs tailored for data analysis. Furthermore, the successful performance of DAAgent over GPT-3.5 underlines the efficacy of targeted instruction tuning.

Implications and Future Directions

The InfiAgent-DABench benchmark is poised to play a pivotal role in assessing the advancements of LLM-based agents in real-world data analysis applications. Its closed-form question format ensures objectivity and precision in evaluation, providing a reliable measure of an agent's capabilities. The introduction of a bespoke agent, DAAgent, additionally sets a precedent for the development of similarly specialized agents across other domains.

The research implies significant developments for both theoretical and practical applications of AI in data analysis. As agents powered by LLMs become increasingly integral to decision-making processes across industries, benchmarks like InfiAgent-DABench offer a crucial tool for progressing toward more reliable and capable AI systems. Future research might aim to extend the scope of such benchmarks to include more complex data interactions and multimodal data sources, as well as exploring novel architectures or training paradigms that could enhance the data analysis capabilities of LLMs.

Overall, the paper presents a thorough and methodologically robust benchmark that addresses a critical gap in the evaluation of LLM-based agents for practical data analysis, offering a structured and reproducible framework for future explorations in this burgeoning area of AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Anthropic. 2023. Claude-2.1. https://www.anthropic.com/index/claude-2-1.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609.
  4. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Daniel Covington. 2016. Analytics: Data Science, Data Analysis, and Predictive Analytics for Business. CreateSpace Independent Publishing Platform.
  8. DeepSeek. 2023. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/DeepSeek-Coder.
  9. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  10. Making the v in vqa matter: Elevating the role of image understanding in visual question answering.
  11. Melissa A Hardy and Alan Bryman. 2004. Handbook of data analysis.
  12. Measuring coding challenge competence with apps.
  13. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey.
  14. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  15. Gary Koop. 2022. Analysis of financial data. John Wiley & Sons Inc.
  16. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  17. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  18. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  19. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  20. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
  21. “what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–31.
  22. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
  23. Killian Lucas. 2023. Open interpreter. https://github.com/KillianLucas/open-interpreter.
  24. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  25. Docvqa: A dataset for vqa on document images.
  26. MiniMax. 2023. Abab5.5. https://api.minimax.chat/.
  27. Mistral.ai. 2023. Mistral. https://mistral.ai/product/.
  28. Yohei Nakajima. 2023. Babyagi. https://github.com/yoheinakajima/babyagi.
  29. OpenAI. 2023a. Gpt-4 technical report.
  30. OpenAI. 2023b. Openai models - openai api. https://platform.openai.com/docs/models/gpt-3-5.
  31. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  32. Phind. 2023. Phind code llama. https://www.phind.com/blog/code-llama-beats-gpt4.
  33. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541.
  34. Tool learning with foundation models.
  35. Toolllm: Facilitating large language models to master 16000+ real-world apis.
  36. Chandan K Reddy and Charu C Aggarwal. 2015. Healthcare data analytics, volume 36. CRC Press.
  37. Reworkd. 2023. Agentgpt. https://github.com/reworkd/AgentGPT.
  38. Code llama: Open foundation models for code.
  39. Leonelli Sabina and Edward N Zalta. 2020. Scientific research and big data. The Stanford Encyclopedia of Philosophy (Summer 2020 Edition).
  40. Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653.
  41. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  42. InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  43. Xwin-LM Team. 2023b. Xwin-lm.
  44. Torantulino. 2023. Autogpt. https://github.com/Significant-Gravitas/AutoGPT.
  45. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  46. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
  47. Large language models are not fair evaluators.
  48. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  49. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  50. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.
  51. Gentopia: A collaborative platform for tool-augmented llms.
  52. Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323.
  53. Baichuan 2: Open large-scale language models.
  54. If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. arXiv preprint arXiv:2401.00812.
  55. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
  56. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  57. React: Synergizing reasoning and acting in language models.
  58. Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
  59. Agenttuning: Enabling generalized agent abilities for llms.
  60. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  61. Mobile-env: An evaluation platform and benchmark for interactive agents in llm era.
  62. Memory-augmented llm personalization with short-and long-term memory coordination. arXiv preprint arXiv:2309.11696.
  63. Unifying the perspectives of nlp and software engineering: A survey on language models for code.
  64. An in-depth survey of large language model-based artificial intelligence agents. arXiv preprint arXiv:2309.14365.
  65. Webarena: A realistic web environment for building autonomous agents.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Xueyu Hu (8 papers)
  2. Ziyu Zhao (28 papers)
  3. Shuang Wei (5 papers)
  4. Ziwei Chai (8 papers)
  5. Guoyin Wang (108 papers)
  6. Xuwu Wang (12 papers)
  7. Jing Su (47 papers)
  8. Jingjing Xu (80 papers)
  9. Ming Zhu (117 papers)
  10. Yao Cheng (58 papers)
  11. Jianbo Yuan (33 papers)
  12. Kun Kuang (114 papers)
  13. Yang Yang (883 papers)
  14. Hongxia Yang (130 papers)
  15. Fei Wu (317 papers)
  16. Qianli Ma (77 papers)
  17. Jiwei Li (137 papers)
Citations (21)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets