Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents (2403.05307v1)

Published 8 Mar 2024 in cs.AI

Abstract: Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of LLM agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  3. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  4. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations.
  5. Beyond generating code: Evaluating gpt on a data visualization course. arXiv preprint arXiv:2306.02914.
  6. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada. Association for Computational Linguistics.
  7. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435.
  8. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
  9. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3029–3051. Association for Computational Linguistics.
  10. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  11. From data mining to knowledge discovery in databases. AI Mag., 17(3):37–54.
  12. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363.
  13. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4928–4949. Association for Computational Linguistics.
  14. Middleware for llms: Tools are instrumental for language agents in complex environments. arXiv preprint arXiv:2402.14672.
  15. Chase: A large-scale and pragmatic chinese dataset for cross-database context-dependent text-to-sql. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2316–2331. Association for Computational Linguistics.
  16. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann.
  17. Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. arXiv preprint arXiv:2312.13671.
  18. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507.
  19. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1049–1065. Association for Computational Linguistics.
  20. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. arXiv preprint arXiv:2312.10160.
  21. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128.
  22. Llm-assisted code cleaning for training accurate code generators. arXiv preprint arXiv:2311.14904.
  23. Developing an integrated framework for using data mining techniques and ontology concepts for process improvement. J. Syst. Softw., 137:78–95.
  24. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18319–18345. PMLR.
  25. S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147.
  26. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36.
  27. Codes: Towards building open-source language models for text-to-sql. arXiv preprint arXiv:2402.16347.
  28. Sheetcopilot: Bringing software productivity to the next level through large language models. Advances in Neural Information Processing Systems, 36.
  29. Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  30. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3102–3116. Association for Computational Linguistics.
  31. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
  32. One shot learning as instruction data prospector for large language models. arXiv preprint arXiv:2312.10302.
  33. Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10381–10399. Association for Computational Linguistics.
  34. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
  35. Dialgen: collaborative human-lm generated dialogues for improved understanding of human-human conversations. arXiv preprint arXiv:2307.07047.
  36. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023- 1 November 2023, pages 2:1–2:22. ACM.
  37. Mohammadreza Pourreza and Davood Rafiei. 2024a. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36.
  38. Mohammadreza Pourreza and Davood Rafiei. 2024b. Dts-sql: Decomposed text-to-sql with small large language models. arXiv preprint arXiv:2402.01117.
  39. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  40. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  41. SpokenWOZ: A large-scale speech-text benchmark for spoken task-oriented dialogue agents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  42. Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242.
  43. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. In The Twelfth International Conference on Learning Representations.
  44. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  45. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  46. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.
  47. Gentopia.ai: A collaborative platform for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 237–245. Association for Computational Linguistics.
  48. Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830.
  49. Learning to simulate natural language feedback for interactive semantic parsing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3149–3170. Association for Computational Linguistics.
  50. Iterative forward tuning boosts in-context learning in language models. arXiv preprint arXiv:2305.13016.
  51. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  52. An imitation game for learning semantic parsers from user interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6883–6902. Association for Computational Linguistics.
  53. Natural language to code generation in interactive data science notebooks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 126–173. Association for Computational Linguistics.
  54. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 1962–1979. Association for Computational Linguistics.
  55. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3911–3921. Association for Computational Linguistics.
  56. Sparc: Cross-domain semantic parsing in context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4511–4523. Association for Computational Linguistics.
  57. When language model meets private library. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 277–288. Association for Computational Linguistics.
  58. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
  59. Tablegpt: Towards unifying tables, nature language and commands into one gpt. arXiv preprint arXiv:2307.08674.
  60. Data-copilot: Bridging billions of data and humans with autonomous workflow. arXiv preprint arXiv:2306.07209.
  61. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815.
  62. CRT-QA: A dataset of complex reasoning question answering over tabular data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore. Association for Computational Linguistics.
  63. CRT-QA: A dataset of complex reasoning question answering over tabular data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2131–2153. Association for Computational Linguistics.
  64. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  65. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jinyang Li (67 papers)
  2. Nan Huo (20 papers)
  3. Yan Gao (157 papers)
  4. Jiayi Shi (12 papers)
  5. Yingxiu Zhao (13 papers)
  6. Ge Qu (7 papers)
  7. Yurong Wu (10 papers)
  8. Chenhao Ma (21 papers)
  9. Jian-Guang Lou (69 papers)
  10. Reynold Cheng (31 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com