Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Data Science Agents (2402.17168v1)

Published 27 Feb 2024 in cs.AI and cs.CL

Abstract: In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. LLMs have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Training and evaluating a jupyter notebook data science assistant. arXiv preprint arXiv:2201.12901.
  3. chapyter. 2023. chapyter. https://github.com/chapyter/chapyter.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  5. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  6. Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038.
  7. Victor Dibia. 2023. Lida: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. arXiv preprint arXiv:2303.02927.
  8. Alpacafarm: A simulation framework for methods that learn from human feedback.
  9. guipsamora. 2020. pandas_exercises. https://github.com/guipsamora/pandas_exercises.
  10. jupyterlab. 2023. jupyter-ai. https://github.com/jupyterlab/jupyter-ai.
  11. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  12. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  13. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  14. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599.
  15. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114.
  16. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Communicative agents for software development.
  19. Experiential co-learning of software-developing agents.
  20. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  21. HuggingGPT: Solving AI tasks with chatGPT and its friends in hugging face. In Thirty-seventh Conference on Neural Information Processing Systems.
  22. shroominic. 2023. codeinterpreter-api. https://github.com/shroominic/codeinterpreter-api.
  23. Significant-Gravitas. 2023. Autogpt. https://github.com/Significant-Gravitas/AutoGPT.
  24. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. In The 17th ACM International Conference on Web Search and Data Mining (WSDM ’24).
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. What should data science education do with large language models? arXiv preprint arXiv:2307.02792.
  28. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  29. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
  30. yoheinakajima. 2023. babyagi. https://github.com/yoheinakajima/babyagi.
  31. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
  32. CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence.
  33. Mlcopilot: Unleashing the power of large language models in solving machine learning tasks.
  34. Data-copilot: Bridging billions of data and humans with autonomous workflow. arXiv preprint arXiv:2306.07209.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuge Zhang (12 papers)
  2. Qiyang Jiang (2 papers)
  3. Xingyu Han (2 papers)
  4. Nan Chen (98 papers)
  5. Yuqing Yang (83 papers)
  6. Kan Ren (41 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.