Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents (2403.05307v1)
Abstract: Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of LLM agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
- Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations.
- Beyond generating code: Evaluating gpt on a data visualization course. arXiv preprint arXiv:2306.02914.
- Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada. Association for Computational Linguistics.
- Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
- Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3029–3051. Association for Computational Linguistics.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- From data mining to knowledge discovery in databases. AI Mag., 17(3):37–54.
- Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363.
- Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4928–4949. Association for Computational Linguistics.
- Middleware for llms: Tools are instrumental for language agents in complex environments. arXiv preprint arXiv:2402.14672.
- Chase: A large-scale and pragmatic chinese dataset for cross-database context-dependent text-to-sql. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2316–2331. Association for Computational Linguistics.
- Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann.
- Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. arXiv preprint arXiv:2312.13671.
- Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1049–1065. Association for Computational Linguistics.
- Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. arXiv preprint arXiv:2312.10160.
- Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128.
- Llm-assisted code cleaning for training accurate code generators. arXiv preprint arXiv:2311.14904.
- Developing an integrated framework for using data mining techniques and ontology concepts for process improvement. J. Syst. Softw., 137:78–95.
- DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18319–18345. PMLR.
- S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147.
- Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36.
- Codes: Towards building open-source language models for text-to-sql. arXiv preprint arXiv:2402.16347.
- Sheetcopilot: Bringing software productivity to the next level through large language models. Advances in Neural Information Processing Systems, 36.
- Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3102–3116. Association for Computational Linguistics.
- Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
- One shot learning as instruction data prospector for large language models. arXiv preprint arXiv:2312.10302.
- Deplot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10381–10399. Association for Computational Linguistics.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
- Dialgen: collaborative human-lm generated dialogues for improved understanding of human-human conversations. arXiv preprint arXiv:2307.07047.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023- 1 November 2023, pages 2:1–2:22. ACM.
- Mohammadreza Pourreza and Davood Rafiei. 2024a. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36.
- Mohammadreza Pourreza and Davood Rafiei. 2024b. Dts-sql: Decomposed text-to-sql with small large language models. arXiv preprint arXiv:2402.01117.
- Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- SpokenWOZ: A large-scale speech-text benchmark for spoken task-oriented dialogue agents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242.
- MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. In The Twelfth International Conference on Learning Representations.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.
- Gentopia.ai: A collaborative platform for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 237–245. Association for Computational Linguistics.
- Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830.
- Learning to simulate natural language feedback for interactive semantic parsing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 3149–3170. Association for Computational Linguistics.
- Iterative forward tuning boosts in-context learning in language models. arXiv preprint arXiv:2305.13016.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- An imitation game for learning semantic parsers from user interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6883–6902. Association for Computational Linguistics.
- Natural language to code generation in interactive data science notebooks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 126–173. Association for Computational Linguistics.
- Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 1962–1979. Association for Computational Linguistics.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3911–3921. Association for Computational Linguistics.
- Sparc: Cross-domain semantic parsing in context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4511–4523. Association for Computational Linguistics.
- When language model meets private library. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 277–288. Association for Computational Linguistics.
- Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
- Tablegpt: Towards unifying tables, nature language and commands into one gpt. arXiv preprint arXiv:2307.08674.
- Data-copilot: Bridging billions of data and humans with autonomous workflow. arXiv preprint arXiv:2306.07209.
- Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815.
- CRT-QA: A dataset of complex reasoning question answering over tabular data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2131–2153, Singapore. Association for Computational Linguistics.
- CRT-QA: A dataset of complex reasoning question answering over tabular data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2131–2153. Association for Computational Linguistics.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658.
- Jinyang Li (67 papers)
- Nan Huo (20 papers)
- Yan Gao (157 papers)
- Jiayi Shi (12 papers)
- Yingxiu Zhao (13 papers)
- Ge Qu (7 papers)
- Yurong Wu (10 papers)
- Chenhao Ma (21 papers)
- Jian-Guang Lou (69 papers)
- Reynold Cheng (31 papers)