How Do Analysts Understand and Verify AI-Assisted Data Analyses? (2309.10947v2)
Abstract: Data analysis is challenging as it requires synthesizing domain knowledge, statistical expertise, and programming skills. Assistants powered by LLMs, such as ChatGPT, can assist analysts by translating natural language instructions into code. However, AI-assistant responses and analysis code can be misaligned with the analyst's intent or be seemingly correct but lead to incorrect conclusions. Therefore, validating AI assistance is crucial and challenging. Here, we explore how analysts understand and verify the correctness of AI-generated analyses. To observe analysts in diverse verification approaches, we develop a design probe equipped with natural language explanations, code, visualizations, and interactive data tables with common data operations. Through a qualitative user study (n=22) using this probe, we uncover common behaviors within verification workflows and how analysts' programming, analysis, and tool backgrounds reflect these behaviors. Additionally, we provide recommendations for analysts and highlight opportunities for designers to improve future AI-assistant experiences.
- 2023. JupyterLab. https://jupyterlab.readthedocs.io/en/stable/ Accessed: 2023-09-02.
- 2023. RStudio: Integrated Development for R. https://www.rstudio.com/ Accessed: 2023-09-02.
- 2023. Tableau Software. https://www.tableau.com/ Accessed: 2023-09-02.
- Estimating the reproducibility of psychological science. Science 349 (2015). https://api.semanticscholar.org/CorpusID:218065162
- Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160. https://api.semanticscholar.org/CorpusID:52965836
- Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:85503944
- Md Waquar Azam. 2022. TELEVISION DATASET 2022. Kaggle. https://www.kaggle.com/datasets/mdwaquarazam/
- Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533 (2016), 452–454.
- Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages 7 (2022), 85 – 111.
- Shubham Bathwal. 2022. Flight Price Prediction. Kaggle. https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021). https://api.semanticscholar.org/CorpusID:232040593
- How HCI interprets the probes. In Proceedings of the SIGCHI conference on Human factors in computing systems. 1077–1086.
- Pranali Bose. 2022. Amazon Seller - Order Status Prediction. Kaggle. https://www.kaggle.com/datasets/pranalibose/amazon-seller-order-status-prediction
- Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America 119 (2022).
- Language Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020). https://api.semanticscholar.org/CorpusID:218971783
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023).
- Training and Evaluating a Jupyter Notebook Data Science Assistant. ArXiv abs/2201.12901 (2022). https://api.semanticscholar.org/CorpusID:246430316
- What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:210927488
- Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021).
- PaLM: Scaling Language Modeling with Pathways. ArXiv abs/2204.02311 (2022).
- Looks good to me: Visualizations as sanity checks. IEEE transactions on visualization and computer graphics 25, 1 (2018), 830–839.
- Passing the Data Baton : A Retrospective Analysis on Data Science Work and Workers. IEEE Transactions on Visualization and Computer Graphics 27 (2020), 1860–1870. https://api.semanticscholar.org/CorpusID:222351819
- Robert DeLine. 2021. Glinda: Supporting Data Science with Live Programming, GUIs and a Domain-specific Language. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987681
- How People Form Folk Theories of Social Media Feeds and What it Means for How We Study Self-Presentation. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5048366
- Jacob Diamond-Reivich. 2020. Mito: Edit a Spreadsheet. Generate Production Ready Python.. In LIVE: Workshop on Live Programming.
- Victor C. Dibia. 2023. LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models. ArXiv abs/2303.02927 (2023).
- Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:212684638
- Upol Ehsan and Mark O. Riedl. 2021. Explainability Pitfalls: Beyond Dark Patterns in Explainable AI. ArXiv abs/2109.12480 (2021). https://api.semanticscholar.org/CorpusID:237940863
- Strategies for Reuse and Sharing among Data Scientists in Software Teams. 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2022), 243–252. https://api.semanticscholar.org/CorpusID:248726301
- Discovering statistics using R, 1st Edition. https://api.semanticscholar.org/CorpusID:45575760
- GitHub. 2022. GitHub Copilot. .https://github.com/features/copilot. Accessed: Sept 12, 2023.
- Garrett Grolemund and Hadley Wickham. 2014. A Cognitive Interpretation of Data Analysis. International Statistical Review 82 (2014). https://api.semanticscholar.org/CorpusID:53622653
- Understanding and Supporting Debugging Workflows in Multiverse Analysis. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:252780673
- A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys (CSUR) 51 (2018), 1 – 42. https://api.semanticscholar.org/CorpusID:3342225
- Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014). https://api.semanticscholar.org/CorpusID:13004424
- Keiran Hardy and Alana Maurushat. 2017. Opening up government data for Big Data analysis and public benefit. Comput. Law Secur. Rev. 33 (2017), 30–37. https://api.semanticscholar.org/CorpusID:63875487
- Managing Messes in Computational Notebooks. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019).
- Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Making. Lecture Notes in Mechanical Engineering (2019). https://api.semanticscholar.org/CorpusID:198329911
- Suraj Jha. 2022. BigBasket Entire Product List ( 28K datapoints). Kaggle. https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
- Survey of Hallucination in Natural Language Generation. Comput. Surveys 55 (2022), 1 – 38. https://api.semanticscholar.org/CorpusID:246652372
- Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419806
- Great Chain of Agents: The Role of Metaphorical Representation of Agents in Conversational Crowdsourcing. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419779
- Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics 18 (2012), 2917–2926.
- From Data to Insight: Work Practices of Analysts in the Enterprise. IEEE Computer Graphics and Applications 34 (2014), 42–50. https://api.semanticscholar.org/CorpusID:6438612
- Table Scraps: An Actionable Framework for Multi-Table Data Wrangling From An Artifact Study of Computational Journalism. IEEE Transactions on Visualization and Computer Graphics 27 (2020), 957–966. https://api.semanticscholar.org/CorpusID:221516111
- Jan-Frederik Kassel and Michael Rohs. 2018. Valletto: A Multimodal Interface for Ubiquitous Visual Analytics. Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5083557
- Variolite: Supporting Exploratory Programming by Data Scientists. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (2017). https://api.semanticscholar.org/CorpusID:2174858
- Towards Effective Foraging by Data Scientists to Find Past Analysis Choices. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:140210955
- Mary Beth Kery and Brad A. Myers. 2017. Exploring exploratory programming. 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) (2017), 25–29. https://api.semanticscholar.org/CorpusID:21574188
- The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5060661
- mage: Fluid Moves Between Code and Graphical Work in Computational Notebooks. Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (2020). https://api.semanticscholar.org/CorpusID:221836345
- Conceptual Metaphors Impact Perceptions of Human-AI Collaboration. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 – 26. https://api.semanticscholar.org/CorpusID:221005643
- Owais Khan. 2022. R.I.S.E. – Research. Innovate. Solve. copilot. Kaggle. https://www.kaggle.com/datasets/owaiskhan9654/rise-research-innovate-solve-copilot
- The Emerging Role of Data Scientists on Software Development Teams. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) (2016), 96–107. https://api.semanticscholar.org/CorpusID:7977224
- Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44 (2018), 1024–1038. https://api.semanticscholar.org/CorpusID:53280229
- ”Help Me Help the AI”: Understanding How Explainability Can Support Human-AI Interaction. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:252780815
- A Data–Frame Theory of Sensemaking.
- The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43 (2011), 1 – 44. https://api.semanticscholar.org/CorpusID:9435548
- Talking datasets: Understanding data sensemaking behaviours. Int. J. Hum. Comput. Stud. 146 (2019), 102562. https://api.semanticscholar.org/CorpusID:208176144
- Unsupervised Translation of Programming Languages. ArXiv abs/2006.03511 (2020). https://api.semanticscholar.org/CorpusID:219401607
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ArXiv abs/2211.11501 (2022).
- Rethinking Explainability as a Dialogue: A Practitioner’s Perspective. ArXiv abs/2202.01875 (2022). https://api.semanticscholar.org/CorpusID:246607834
- Understanding the Usability of AI Programming Assistants. ArXiv abs/2303.17125 (2023). https://api.semanticscholar.org/CorpusID:257833548
- Questioning the AI: Informing Design Practices for Explainable AI User Experiences. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:210064344
- Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. ArXiv abs/2110.10790 (2021). https://api.semanticscholar.org/CorpusID:239050385
- Teaching Models to Express Their Uncertainty in Words. Trans. Mach. Learn. Res. 2022 (2022). https://api.semanticscholar.org/CorpusID:249191391
- Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 23 (2020). https://api.semanticscholar.org/CorpusID:229722844
- Understanding the Role of Alternatives in Data Analysis Practices. IEEE Transactions on Visualization and Computer Graphics 26 (2020), 66–76.
- “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258107840
- Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2019).
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. ArXiv abs/2303.16634 (2023). https://api.semanticscholar.org/CorpusID:257804696
- Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:218482503
- Ewa Luger and Abigail Sellen. 2016. ”Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (2016). https://api.semanticscholar.org/CorpusID:1036498
- On the Design of AI-powered Code Assistants for Notebooks. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:256274637
- Swaroop Mishra and Elnaz Nouri. 2022. HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models. ArXiv abs/2208.08232 (2022).
- Chadner Navarro. 2022. Travel+Leisure World’s Best Hotels 2022. Kaggle. https://www.kaggle.com/datasets/narmelan/travelleisure-worlds-best-hotels-2022
- ObservableHQ. 2023. Summary Table. https://observablehq.com/@observablehq/summary-table Accessed: July 30, 2023.
- Demystifying GPT Self-Repair for Code Generation. ArXiv abs/2306.09896 (2023). https://api.semanticscholar.org/CorpusID:259187989
- OpenAI. 2022. ChatGPT: Conversational AI Language Model. https://chat.openai.com. Accessed on June 1, 2023.
- OpenAI. 2023a. Chat with GPT-4 Code Interpreter. https://chat.openai.com/?model=gpt-4-code-interpreter. Accessed August 26, 2023.
- OpenAI. 2023b. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
- State of the Art and Open Challenges in Natural Language Interfaces to Data. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020). https://api.semanticscholar.org/CorpusID:218881987
- Raja Parasuraman and Dietrich Manzey. 2010. Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors: The Journal of Human Factors and Ergonomics Society 52 (2010), 381 – 410. https://api.semanticscholar.org/CorpusID:2279803
- Chris Perry and Shrestha Basu Mallick. 2023. AI-powered coding, free of charge with Colab. https://blog.google/technology/developers/google-colab-ai-coding-features/
- Peter Pirolli. 2007. The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis.
- Datamations: Animated Explanations of Data Analysis Pipelines. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987105
- GridBook: Natural Language Formulas for the Spreadsheet Grid. 27th International Conference on Intelligent User Interfaces (2022). https://api.semanticscholar.org/CorpusID:247585151
- Shivani Rana. 2022. Bollywood Movies Box-Office Collection 2022. Kaggle. https://www.kaggle.com/datasets/shivanirana63/bollywood-movies-boxoffice-collection-2022
- Evaluating the Interpretability of Generative Models by Interactive Reconstruction. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:231749921
- Exploration and Explanation in Computational Notebooks. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5048947
- The cost structure of sensemaking. Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (1993). https://api.semanticscholar.org/CorpusID:207177544
- What is it like to program with artificial intelligence?. In Annual Workshop of the Psychology of Programming Interest Group. https://api.semanticscholar.org/CorpusID:251554706
- Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organizational Behavior and Human Decision Processes (2021).
- Vidya Setlur and Melanie K. Tory. 2022. How do you Converse with an Analytical Chatbot? Revisiting Gricean Maxims for Designing Analytical Conversational Behavior. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:247054720
- Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science 1 (2018), 337 – 356.
- Victor Soeiro. 2022. Netflix TV Shows and Movies. Kaggle. https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies
- Colette Stallbaumer. 2023. Introducing Microsoft 365 Copilot—A whole new way to work. https://www.microsoft.com/en-us/microsoft-365/blog/2023/03/16/introducing-microsoft-365-copilot-a-whole-new-way-to-work/
- Investigating Explainability of Generative AI for Code through Scenario-based Design. 27th International Conference on Intelligent User Interfaces (2022). https://api.semanticscholar.org/CorpusID:246705915
- Md Mahmudul Hasan Suzan and Nishat Ahmed Samrin. 2022. Students Adaptability Level in Online Education. Kaggle. https://www.kaggle.com/datasets/mdmahmudulhasansuzan/students-adaptability-level-in-online-education
- Barbara Ubaldi. 2013. Open Government Data: Towards Empirical Analysis of Open Government Data Initiatives. https://api.semanticscholar.org/CorpusID:260737241
- Unknown. 2023. Create Models and Automate Data Workflows with AI. https://www.datagran.io
- Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. CHI Conference on Human Factors in Computing Systems Extended Abstracts (2022). https://api.semanticscholar.org/CorpusID:247255943
- Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. ArXiv abs/2302.07248 (2023). https://api.semanticscholar.org/CorpusID:256846746
- Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419893
- How Data Scientists Use Computational Notebooks for Real-Time Collaboration. Proceedings of the ACM on Human-Computer Interaction 3 (2019), 1 – 30. https://api.semanticscholar.org/CorpusID:207946488
- Slide4N: Creating Presentation Slides from Computational Notebooks with Human-AI Collaboration. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258216753
- Complacency and Automation Bias in the Use of Imperfect Automation. Human Factors: The Journal of Human Factors and Ergonomics Society 57 (2015), 728 – 739. https://api.semanticscholar.org/CorpusID:12243641
- B2: Bridging Code and Interactive Visualization in Computational Notebooks. Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (2020). https://api.semanticscholar.org/CorpusID:221492874
- Visualizing the Scripts of Data Wrangling with SOMNUS. IEEE Transactions on Visualization and Computer Graphics PP (2022), 1–1. https://api.semanticscholar.org/CorpusID:246287020
- In-IDE Code Generation from Natural Language: Promise and Challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (2021), 1 – 47. https://api.semanticscholar.org/CorpusID:231718679
- Natural Language to Code Generation in Interactive Data Science Notebooks. ArXiv abs/2212.09248 (2022). https://api.semanticscholar.org/CorpusID:254854112
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258217984
- Enhao Zhang and Nikola Banovic. 2021. Method for Exploring Generative Adversarial Networks (GANs) via Automatically Generated Image Galleries. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987602
- Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:247594488
- Qiyu Zhi and Ronald A. Metoyer. 2020. GameBot: A Visualization-augmented Chatbot for Sports Game. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:216611752
- Productivity assessment of neural code completion. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (2022). https://api.semanticscholar.org/CorpusID:248798468
- Ken Gu (8 papers)
- Ruoxi Shang (3 papers)
- Tim Althoff (64 papers)
- Chenglong Wang (80 papers)
- Steven M. Drucker (4 papers)