Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Do Analysts Understand and Verify AI-Assisted Data Analyses? (2309.10947v2)

Published 19 Sep 2023 in cs.HC

Abstract: Data analysis is challenging as it requires synthesizing domain knowledge, statistical expertise, and programming skills. Assistants powered by LLMs, such as ChatGPT, can assist analysts by translating natural language instructions into code. However, AI-assistant responses and analysis code can be misaligned with the analyst's intent or be seemingly correct but lead to incorrect conclusions. Therefore, validating AI assistance is crucial and challenging. Here, we explore how analysts understand and verify the correctness of AI-generated analyses. To observe analysts in diverse verification approaches, we develop a design probe equipped with natural language explanations, code, visualizations, and interactive data tables with common data operations. Through a qualitative user study (n=22) using this probe, we uncover common behaviors within verification workflows and how analysts' programming, analysis, and tool backgrounds reflect these behaviors. Additionally, we provide recommendations for analysts and highlight opportunities for designers to improve future AI-assistant experiences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (116)
  1. 2023. JupyterLab. https://jupyterlab.readthedocs.io/en/stable/ Accessed: 2023-09-02.
  2. 2023. RStudio: Integrated Development for R. https://www.rstudio.com/ Accessed: 2023-09-02.
  3. 2023. Tableau Software. https://www.tableau.com/ Accessed: 2023-09-02.
  4. Estimating the reproducibility of psychological science. Science 349 (2015). https://api.semanticscholar.org/CorpusID:218065162
  5. Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160. https://api.semanticscholar.org/CorpusID:52965836
  6. Resilient Chatbots: Repair Strategy Preferences for Conversational Breakdowns. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:85503944
  7. Md Waquar Azam. 2022. TELEVISION DATASET 2022. Kaggle. https://www.kaggle.com/datasets/mdwaquarazam/
  8. Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533 (2016), 452–454.
  9. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages 7 (2022), 85 – 111.
  10. Shubham Bathwal. 2022. Flight Price Prediction. Kaggle. https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction
  11. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021). https://api.semanticscholar.org/CorpusID:232040593
  12. How HCI interprets the probes. In Proceedings of the SIGCHI conference on Human factors in computing systems. 1077–1086.
  13. Pranali Bose. 2022. Amazon Seller - Order Status Prediction. Kaggle. https://www.kaggle.com/datasets/pranalibose/amazon-seller-order-status-prediction
  14. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America 119 (2022).
  15. Language Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020). https://api.semanticscholar.org/CorpusID:218971783
  16. Sparks of Artificial General Intelligence: Early experiments with GPT-4. ArXiv abs/2303.12712 (2023).
  17. Training and Evaluating a Jupyter Notebook Data Science Assistant. ArXiv abs/2201.12901 (2022). https://api.semanticscholar.org/CorpusID:246430316
  18. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:210927488
  19. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021).
  20. PaLM: Scaling Language Modeling with Pathways. ArXiv abs/2204.02311 (2022).
  21. Looks good to me: Visualizations as sanity checks. IEEE transactions on visualization and computer graphics 25, 1 (2018), 830–839.
  22. Passing the Data Baton : A Retrospective Analysis on Data Science Work and Workers. IEEE Transactions on Visualization and Computer Graphics 27 (2020), 1860–1870. https://api.semanticscholar.org/CorpusID:222351819
  23. Robert DeLine. 2021. Glinda: Supporting Data Science with Live Programming, GUIs and a Domain-specific Language. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987681
  24. How People Form Folk Theories of Social Media Feeds and What it Means for How We Study Self-Presentation. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5048366
  25. Jacob Diamond-Reivich. 2020. Mito: Edit a Spreadsheet. Generate Production Ready Python.. In LIVE: Workshop on Live Programming.
  26. Victor C. Dibia. 2023. LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models. ArXiv abs/2303.02927 (2023).
  27. Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:212684638
  28. Upol Ehsan and Mark O. Riedl. 2021. Explainability Pitfalls: Beyond Dark Patterns in Explainable AI. ArXiv abs/2109.12480 (2021). https://api.semanticscholar.org/CorpusID:237940863
  29. Strategies for Reuse and Sharing among Data Scientists in Software Teams. 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2022), 243–252. https://api.semanticscholar.org/CorpusID:248726301
  30. Discovering statistics using R, 1st Edition. https://api.semanticscholar.org/CorpusID:45575760
  31. GitHub. 2022. GitHub Copilot. .https://github.com/features/copilot. Accessed: Sept 12, 2023.
  32. Garrett Grolemund and Hadley Wickham. 2014. A Cognitive Interpretation of Data Analysis. International Statistical Review 82 (2014). https://api.semanticscholar.org/CorpusID:53622653
  33. Understanding and Supporting Debugging Workflows in Multiverse Analysis. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:252780673
  34. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys (CSUR) 51 (2018), 1 – 42. https://api.semanticscholar.org/CorpusID:3342225
  35. Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014). https://api.semanticscholar.org/CorpusID:13004424
  36. Keiran Hardy and Alana Maurushat. 2017. Opening up government data for Big Data analysis and public benefit. Comput. Law Secur. Rev. 33 (2017), 30–37. https://api.semanticscholar.org/CorpusID:63875487
  37. Managing Messes in Computational Notebooks. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019).
  38. Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Making. Lecture Notes in Mechanical Engineering (2019). https://api.semanticscholar.org/CorpusID:198329911
  39. Suraj Jha. 2022. BigBasket Entire Product List ( 28K datapoints). Kaggle. https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
  40. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55 (2022), 1 – 38. https://api.semanticscholar.org/CorpusID:246652372
  41. Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419806
  42. Great Chain of Agents: The Role of Metaphorical Representation of Agents in Conversational Crowdsourcing. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419779
  43. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics 18 (2012), 2917–2926.
  44. From Data to Insight: Work Practices of Analysts in the Enterprise. IEEE Computer Graphics and Applications 34 (2014), 42–50. https://api.semanticscholar.org/CorpusID:6438612
  45. Table Scraps: An Actionable Framework for Multi-Table Data Wrangling From An Artifact Study of Computational Journalism. IEEE Transactions on Visualization and Computer Graphics 27 (2020), 957–966. https://api.semanticscholar.org/CorpusID:221516111
  46. Jan-Frederik Kassel and Michael Rohs. 2018. Valletto: A Multimodal Interface for Ubiquitous Visual Analytics. Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5083557
  47. Variolite: Supporting Exploratory Programming by Data Scientists. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (2017). https://api.semanticscholar.org/CorpusID:2174858
  48. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:140210955
  49. Mary Beth Kery and Brad A. Myers. 2017. Exploring exploratory programming. 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) (2017), 25–29. https://api.semanticscholar.org/CorpusID:21574188
  50. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5060661
  51. mage: Fluid Moves Between Code and Graphical Work in Computational Notebooks. Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (2020). https://api.semanticscholar.org/CorpusID:221836345
  52. Conceptual Metaphors Impact Perceptions of Human-AI Collaboration. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 – 26. https://api.semanticscholar.org/CorpusID:221005643
  53. Owais Khan. 2022. R.I.S.E. – Research. Innovate. Solve. copilot. Kaggle. https://www.kaggle.com/datasets/owaiskhan9654/rise-research-innovate-solve-copilot
  54. The Emerging Role of Data Scientists on Software Development Teams. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) (2016), 96–107. https://api.semanticscholar.org/CorpusID:7977224
  55. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44 (2018), 1024–1038. https://api.semanticscholar.org/CorpusID:53280229
  56. ”Help Me Help the AI”: Understanding How Explainability Can Support Human-AI Interaction. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:252780815
  57. A Data–Frame Theory of Sensemaking.
  58. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43 (2011), 1 – 44. https://api.semanticscholar.org/CorpusID:9435548
  59. Talking datasets: Understanding data sensemaking behaviours. Int. J. Hum. Comput. Stud. 146 (2019), 102562. https://api.semanticscholar.org/CorpusID:208176144
  60. Unsupervised Translation of Programming Languages. ArXiv abs/2006.03511 (2020). https://api.semanticscholar.org/CorpusID:219401607
  61. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ArXiv abs/2211.11501 (2022).
  62. Rethinking Explainability as a Dialogue: A Practitioner’s Perspective. ArXiv abs/2202.01875 (2022). https://api.semanticscholar.org/CorpusID:246607834
  63. Understanding the Usability of AI Programming Assistants. ArXiv abs/2303.17125 (2023). https://api.semanticscholar.org/CorpusID:257833548
  64. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:210064344
  65. Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. ArXiv abs/2110.10790 (2021). https://api.semanticscholar.org/CorpusID:239050385
  66. Teaching Models to Express Their Uncertainty in Words. Trans. Mach. Learn. Res. 2022 (2022). https://api.semanticscholar.org/CorpusID:249191391
  67. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 23 (2020). https://api.semanticscholar.org/CorpusID:229722844
  68. Understanding the Role of Alternatives in Data Analysis Practices. IEEE Transactions on Visualization and Computer Graphics 26 (2020), 66–76.
  69. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258107840
  70. Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2019).
  71. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. ArXiv abs/2303.16634 (2023). https://api.semanticscholar.org/CorpusID:257804696
  72. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:218482503
  73. Ewa Luger and Abigail Sellen. 2016. ”Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (2016). https://api.semanticscholar.org/CorpusID:1036498
  74. On the Design of AI-powered Code Assistants for Notebooks. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:256274637
  75. Swaroop Mishra and Elnaz Nouri. 2022. HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models. ArXiv abs/2208.08232 (2022).
  76. Chadner Navarro. 2022. Travel+Leisure World’s Best Hotels 2022. Kaggle. https://www.kaggle.com/datasets/narmelan/travelleisure-worlds-best-hotels-2022
  77. ObservableHQ. 2023. Summary Table. https://observablehq.com/@observablehq/summary-table Accessed: July 30, 2023.
  78. Demystifying GPT Self-Repair for Code Generation. ArXiv abs/2306.09896 (2023). https://api.semanticscholar.org/CorpusID:259187989
  79. OpenAI. 2022. ChatGPT: Conversational AI Language Model. https://chat.openai.com. Accessed on June 1, 2023.
  80. OpenAI. 2023a. Chat with GPT-4 Code Interpreter. https://chat.openai.com/?model=gpt-4-code-interpreter. Accessed August 26, 2023.
  81. OpenAI. 2023b. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  82. State of the Art and Open Challenges in Natural Language Interfaces to Data. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020). https://api.semanticscholar.org/CorpusID:218881987
  83. Raja Parasuraman and Dietrich Manzey. 2010. Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors: The Journal of Human Factors and Ergonomics Society 52 (2010), 381 – 410. https://api.semanticscholar.org/CorpusID:2279803
  84. Chris Perry and Shrestha Basu Mallick. 2023. AI-powered coding, free of charge with Colab. https://blog.google/technology/developers/google-colab-ai-coding-features/
  85. Peter Pirolli. 2007. The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis.
  86. Datamations: Animated Explanations of Data Analysis Pipelines. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987105
  87. GridBook: Natural Language Formulas for the Spreadsheet Grid. 27th International Conference on Intelligent User Interfaces (2022). https://api.semanticscholar.org/CorpusID:247585151
  88. Shivani Rana. 2022. Bollywood Movies Box-Office Collection 2022. Kaggle. https://www.kaggle.com/datasets/shivanirana63/bollywood-movies-boxoffice-collection-2022
  89. Evaluating the Interpretability of Generative Models by Interactive Reconstruction. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:231749921
  90. Exploration and Explanation in Computational Notebooks. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (2018). https://api.semanticscholar.org/CorpusID:5048947
  91. The cost structure of sensemaking. Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (1993). https://api.semanticscholar.org/CorpusID:207177544
  92. What is it like to program with artificial intelligence?. In Annual Workshop of the Psychology of Programming Interest Group. https://api.semanticscholar.org/CorpusID:251554706
  93. Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organizational Behavior and Human Decision Processes (2021).
  94. Vidya Setlur and Melanie K. Tory. 2022. How do you Converse with an Analytical Chatbot? Revisiting Gricean Maxims for Designing Analytical Conversational Behavior. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:247054720
  95. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science 1 (2018), 337 – 356.
  96. Victor Soeiro. 2022. Netflix TV Shows and Movies. Kaggle. https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies
  97. Colette Stallbaumer. 2023. Introducing Microsoft 365 Copilot—A whole new way to work. https://www.microsoft.com/en-us/microsoft-365/blog/2023/03/16/introducing-microsoft-365-copilot-a-whole-new-way-to-work/
  98. Investigating Explainability of Generative AI for Code through Scenario-based Design. 27th International Conference on Intelligent User Interfaces (2022). https://api.semanticscholar.org/CorpusID:246705915
  99. Md Mahmudul Hasan Suzan and Nishat Ahmed Samrin. 2022. Students Adaptability Level in Online Education. Kaggle. https://www.kaggle.com/datasets/mdmahmudulhasansuzan/students-adaptability-level-in-online-education
  100. Barbara Ubaldi. 2013. Open Government Data: Towards Empirical Analysis of Open Government Data Initiatives. https://api.semanticscholar.org/CorpusID:260737241
  101. Unknown. 2023. Create Models and Automate Data Workflows with AI. https://www.datagran.io
  102. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. CHI Conference on Human Factors in Computing Systems Extended Abstracts (2022). https://api.semanticscholar.org/CorpusID:247255943
  103. Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. ArXiv abs/2302.07248 (2023). https://api.semanticscholar.org/CorpusID:256846746
  104. Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:248419893
  105. How Data Scientists Use Computational Notebooks for Real-Time Collaboration. Proceedings of the ACM on Human-Computer Interaction 3 (2019), 1 – 30. https://api.semanticscholar.org/CorpusID:207946488
  106. Slide4N: Creating Presentation Slides from Computational Notebooks with Human-AI Collaboration. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258216753
  107. Complacency and Automation Bias in the Use of Imperfect Automation. Human Factors: The Journal of Human Factors and Ergonomics Society 57 (2015), 728 – 739. https://api.semanticscholar.org/CorpusID:12243641
  108. B2: Bridging Code and Interactive Visualization in Computational Notebooks. Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (2020). https://api.semanticscholar.org/CorpusID:221492874
  109. Visualizing the Scripts of Data Wrangling with SOMNUS. IEEE Transactions on Visualization and Computer Graphics PP (2022), 1–1. https://api.semanticscholar.org/CorpusID:246287020
  110. In-IDE Code Generation from Natural Language: Promise and Challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (2021), 1 – 47. https://api.semanticscholar.org/CorpusID:231718679
  111. Natural Language to Code Generation in Interactive Data Science Notebooks. ArXiv abs/2212.09248 (2022). https://api.semanticscholar.org/CorpusID:254854112
  112. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258217984
  113. Enhao Zhang and Nikola Banovic. 2021. Method for Exploring Generative Adversarial Networks (GANs) via Automatically Generated Image Galleries. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021). https://api.semanticscholar.org/CorpusID:233987602
  114. Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (2022). https://api.semanticscholar.org/CorpusID:247594488
  115. Qiyu Zhi and Ronald A. Metoyer. 2020. GameBot: A Visualization-augmented Chatbot for Sports Game. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (2020). https://api.semanticscholar.org/CorpusID:216611752
  116. Productivity assessment of neural code completion. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (2022). https://api.semanticscholar.org/CorpusID:248798468
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ken Gu (8 papers)
  2. Ruoxi Shang (3 papers)
  3. Tim Althoff (64 papers)
  4. Chenglong Wang (80 papers)
  5. Steven M. Drucker (4 papers)
Citations (17)