Capture the Flag: Uncovering Data Insights with Large Language Models (2312.13876v1)
Abstract: The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using LLMs to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.
- Juice: A large scale distantly supervised dataset for open domain context-based code generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
- How can AI automate end-to-end data science? ArXiv, abs/1910.14436, 2019.
- Randy Bean. Why becoming a data-driven organization is so hard. Harvard Business Review, feb 2022.
- Sparks of artificial general intelligence: Early experiments with GPT-4, 2023.
- Training and evaluating a Jupyter notebook data science assistant. ArXiv, abs/2201.12901, 2022.
- Heemali Chaudhari. Adidas sales dataset. https://www.kaggle.com/datasets/heemalichaudhari/adidas-sales-dataset, 2023. Accessed: September 29, 2023.
- Neurosymbolic programming. Found. Trends Program. Lang., 7:158–243, 2021.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021a.
- PlotCoder: Hierarchical decoding for synthesizing visualization code in programmatic context. In Association for Computational Linguistics (ACL), 2021b.
- Beyond generating code: Evaluating GPT on a data visualization course, 2023.
- Is GPT-4 a good data analyst? arXiv, abs/2305.15038, 2023.
- Eric Colson. What ai-driven decision making looks like. Harvard Business Review, jul 2019.
- Analyza: Exploring data with conversation. In Intelligent User Interfaces, 2017.
- Victor Dibia. LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. In Association for Computational Linguistics (ACL), 2023.
- Data2vis: Automatic generation of data visualizations using sequence-to-sequence recurrent neural networks. IEEE Computer Graphics and Applications, 39:33–46, 2018.
- Ralph Foorthuis. On the nature and types of anomalies: a review of deviations in data. International Journal of Data Science and Analytics, 12, 2020.
- Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022.
- Significant Gravitas. AutoGPT, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT.
- How do analysts understand and verify ai-assisted data analyses? arXiv, abs/2309.10947, 2023.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
- Execution-based evaluation for data science code generation models. In ACL Workshop on Data Science with Human-in-the-Loop (Language Advances), 2022.
- xCodeEval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv, abs/2303.03004, 2023.
- Transfer anomaly detection by inferring latent domain representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
- DS-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
- SheetCopilot: Bringing software productivity to the next level through large language models. arXiv, abs/2305.19308, 2023a.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023b.
- Competition-level code generation with AlphaCode. Science, 378:1092 – 1097, 2022.
- Demonstration of InsightPilot: An LLM-empowered automated data exploration system. CoRR, abs/2304.00477, 2023.
- Chat2vis: Fine-tuning data visualisations using multilingual natural language text and pre-trained large language models. arXiv, abs/2303.14292, 2023.
- Big data: The management revolution. Harvard Business Review, oct 2012. From the Magazine (October 2012).
- BI-REC: Guided data analysis for conversational business intelligence. arXiv, abs/2105.00467, 2021.
- Demystifying GPT self-repair for code generation, 2023.
- OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023a.
- OpenAI. ChatGPT. https://chat.openai.com/chat, 2023b.
- Elements of an automatic data scientist. In International Symposium on Intelligent Data Analysis, 2018.
- A unifying review of deep and shallow anomaly detection. Proc. IEEE, 2021.
- A unified survey on anomaly, novelty, open-set, and out of-distribution detection: Solutions and future challenges. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
- Timeseries anomaly detection using temporal hierarchical one-class network. In Advances in Neural Information Processing Systems, 2020.
- HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. arXiv, abs/2303.17580, 2023.
- Reflexion: Language agents with verbal reinforcement learning, 2023.
- The Automatic Statistician, page 161–173. Springer International Publishing, 2019.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv, abs/2305.04091, 2023.
- Execution-based evaluation for open-domain code generation. ArXiv, abs/2212.10481, 2022.
- Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- OpenOOD: Benchmarking generalized out-of-distribution detection. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=gT6j4_tskUt.
- Natural language to code generation in interactive data science notebooks. In Association for Computational Linguistics (ACL), 2023.
- Interpretable program synthesis. CHI Conference on Human Factors in Computing Systems, 2021.
- Issam Laradji (37 papers)
- Perouz Taslakian (31 papers)
- Sai Rajeswar (27 papers)
- Valentina Zantedeschi (29 papers)
- Alexandre Lacoste (42 papers)
- Nicolas Chapados (25 papers)
- David Vazquez (73 papers)
- Christopher Pal (97 papers)
- Alexandre Drouin (34 papers)