Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Take It, Leave It, or Fix It: Measuring Productivity and Trust in Human-AI Collaboration (2402.18498v2)

Published 28 Feb 2024 in cs.HC

Abstract: Although recent developments in generative AI have greatly enhanced the capabilities of conversational agents such as Google's Gemini (formerly Bard) or OpenAI's ChatGPT, it's unclear whether the usage of these agents aids users across various contexts. To better understand how access to conversational AI affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (N=76) as they completed a programming exam with and without access to Bard. Effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). Our findings include evidence of automation complacency, increased reliance on the AI over the course of the task, and increased performance for novices on "solve"-type questions when using the AI. We discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. What Do Developers Use the Crowd For? A Study Using Stack Overflow. IEEE Software 34, 2 (2017), 53–60. https://doi.org/10.1109/MS.2017.31
  2. Naser Al Madi. 2023. How Readable is Model-Generated Code? Examining Readability and Visual Inspection of GitHub Copilot. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 205, 5 pages. https://doi.org/10.1145/3551349.3560438
  3. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233
  4. Achieving Saturation in Thematic Analysis: Development and Refinement of a Codebook,. Comprehensive Psychology 3 (2014), 03.CP.3.4. https://doi.org/10.2466/03.CP.3.4 arXiv:https://doi.org/10.2466/03.CP.3.4
  5. Large language models and the perils of their hallucinations. Critical Care 27, 1 (2023), 1–2.
  6. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proc. ACM Program. Lang. 7, OOPSLA1, Article 78 (apr 2023), 27 pages. https://doi.org/10.1145/3586030
  7. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa arXiv:https://www.tandfonline.com/doi/pdf/10.1191/1478088706qp063oa
  8. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., Virtual, 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  9. The Influence of Feedback Types on the Use of Automation During Learning. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 65, 1 (2021), 143–147. https://doi.org/10.1177/1071181321651228 arXiv:https://doi.org/10.1177/1071181321651228
  10. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 188 (apr 2021), 21 pages. https://doi.org/10.1145/3449287
  11. Impacts of Personal Characteristics on User Trust in Conversational Recommender Systems. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 489, 14 pages. https://doi.org/10.1145/3491102.3517471
  12. Supporting High-Uncertainty Decisions through AI and Logic-Style Explanations. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 251–263. https://doi.org/10.1145/3581641.3584080
  13. Where is the land of Opportunity? The Geography of Intergenerational Mobility in the United States *. The Quarterly Journal of Economics 129, 4 (09 2014), 1553–1623. https://doi.org/10.1093/qje/qju022 arXiv:https://academic.oup.com/qje/article-pdf/129/4/1553/30631636/qju022.pdf
  14. Chun-Wei Chiang and Ming Yin. 2022. Exploring the Effects of Machine Learning Literacy Interventions on Laypeople’s Reliance on Machine Learning Models. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 148–161. https://doi.org/10.1145/3490099.3511121
  15. Morten W Fagerland. 2012. t-tests, non-parametric tests, and large studies—a paradox of statistical practice? BMC medical research methodology 12, 1 (2012), 1–7.
  16. K. J. Kevin Feng and David W. Mcdonald. 2023. Addressing UX Practitioners’ Challenges in Designing ML Applications: An Interactive Machine Learning Approach. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 337–352. https://doi.org/10.1145/3581641.3584064
  17. The SPACE of Developer Productivity: There’s More to It than You Think. Queue 19, 1 (mar 2021), 20–48. https://doi.org/10.1145/3454122.3454124
  18. Experimental Economics: Past and Future. Annual Review of Economics 14, 1 (2022), 777–794. https://doi.org/10.1146/annurev-economics-081621-124424 arXiv:https://doi.org/10.1146/annurev-economics-081621-124424
  19. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 9 (8 Feb 2023), e45312. https://doi.org/10.2196/45312
  20. Eun Go and S. Shyam Sundar. 2019. Humanizing chatbots: The effects of visual, identity and conversational cues on humanness perceptions. Computers in Human Behavior 97 (2019), 304–316. https://doi.org/10.1016/j.chb.2019.01.020
  21. How Many Interviews Are Enough?: An Experiment with Data Saturation and Variability. Field Methods 18, 1 (2006), 59–82. https://doi.org/10.1177/1525822X05279903 arXiv:https://doi.org/10.1177/1525822X05279903
  22. The Experimenters’ Dilemma: Inferential Preferences over Populations. arXiv:econ.GN/2107.05064
  23. Human-AI Collaboration: The Effect of AI Delegation on Human Task Performance and Task Satisfaction. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 453–463. https://doi.org/10.1145/3581641.3584052
  24. Makoto Itoh. 2011. A model of trust in automation: Why humans over-trust?. In SICE Annual Conference 2011. IEEE, Tokyo, Japan, 198–201.
  25. Foundations for an Empirically Determined Scale of Trust in Automated Systems. International Journal of Cognitive Ergonomics 4, 1 (2000), 53–71. https://doi.org/10.1207/S15327566IJCE0401_04 arXiv:https://doi.org/10.1207/S15327566IJCE0401_04
  26. It Seems Smart, but It Acts Stupid: Development of Trust in AI Advice in a Repeated Legal Decision-Making Task. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 528–539. https://doi.org/10.1145/3581641.3584058
  27. ”Because AI is 100% Right and Safe”: User Attitudes and Sources of AI Authority in India. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 158, 18 pages. https://doi.org/10.1145/3491102.3517533
  28. ChatGPT: Jack of all trades, master of none. Information Fusion 99 (2023), 101861. https://doi.org/10.1016/j.inffus.2023.101861
  29. Measurement of trust in automation: A narrative review and reference guide. Frontiers in psychology 12 (2021), 604977.
  30. Vijay Krishna and John Morgan. 2001. A Model of Expertise*. The Quarterly Journal of Economics 116, 2 (05 2001), 747–775. https://doi.org/10.1162/00335530151144159 arXiv:https://academic.oup.com/qje/article-pdf/116/2/747/5375310/116-2-747.pdf
  31. Justin Kruger and David Dunning. 1999. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of personality and social psychology 77, 6 (1999), 1121.
  32. Trade-Offs for Substituting a Human with an Agent in a Pair Programming Context: The Good, the Bad, and the Ugly. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 243, 20 pages. https://doi.org/10.1145/3411764.3445659
  33. John D. Lee and Neville Moray. 1994. Trust, self-confidence, and operators’ adaptation to automation. International Journal of Human-Computer Studies 40, 1 (1994), 153–184. https://doi.org/10.1006/ijhc.1994.1007
  34. John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance. Human Factors 46, 1 (2004), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392 arXiv:https://doi.org/10.1518/hfes.46.1.50_30392 PMID: 15151155.
  35. Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI Teams. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 681–694.
  36. The dynamics of trust: comparing humans to automation. Journal of Experimental Psychology: Applied 6, 2 (2000), 104.
  37. ChatGPT in healthcare: A taxonomy and systematic review. Computer Methods and Programs in Biomedicine 245 (2024), 108013. https://doi.org/10.1016/j.cmpb.2024.108013
  38. Brady D Lund and Ting Wang. 2023. Chatting about ChatGPT: how may AI and GPT impact academia and libraries? Library Hi Tech News 40, 3 (2023), 26–29.
  39. Automation-induced complacency potential: Development and validation of a new scale. Frontiers in psychology 10 (2019), 225.
  40. Stephanie M. Merritt and Daniel R. Ilgen. 2008. Not All Trust Is Created Equal: Dispositional and History-Based Trust in Human-Automation Interactions. Human Factors 50, 2 (2008), 194–210. https://doi.org/10.1518/001872008X288574 arXiv:https://doi.org/10.1518/001872008X288574 PMID: 18516832.
  41. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. arXiv:cs.SE/2210.14306
  42. Shakked Noy and Whitney Zhang. 2023. Experimental evidence on the productivity effects of generative artificial intelligence. Science 381, 6654 (2023), 187–192. https://doi.org/10.1126/science.adh2586 arXiv:https://www.science.org/doi/pdf/10.1126/science.adh2586
  43. The Impact of Expertise in the Loop for Exploring Machine Rationality. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 307–321. https://doi.org/10.1145/3581641.3584040
  44. The risks of using ChatGPT to obtain common safety-related information and advice. Safety Science 167 (2023), 106244. https://doi.org/10.1016/j.ssci.2023.106244
  45. Raja Parasuraman and Dietrich H Manzey. 2010. Complacency and bias in human use of automation: An attentional integration. Human factors 52, 3 (2010), 381–410.
  46. Raja Parasuraman and Victor Riley. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors 39, 2 (1997), 230–253. https://doi.org/10.1518/001872097778543886 arXiv:https://doi.org/10.1518/001872097778543886
  47. Understanding Uncertainty: How Lay Decision-Makers Perceive and Interpret Uncertainty in Human-AI Decision Making. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 379–396. https://doi.org/10.1145/3581641.3584033
  48. The advantages and limitations of using ChatGPT to enhance technological research. Technology in Society 76 (2024), 102426. https://doi.org/10.1016/j.techsoc.2023.102426
  49. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 491–514. https://doi.org/10.1145/3581641.3584037
  50. Can artificial intelligence help for scientific writing? Critical care 27, 1 (2023), 1–5.
  51. I Can Do Better than Your AI: Expertise and Explanations. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI ’19). Association for Computing Machinery, New York, NY, USA, 240–251. https://doi.org/10.1145/3301275.3302308
  52. Herbert A. Simon. 1986. Rationality in Psychology and Economics. The Journal of Business 59, 4 (1986), S209–S224. http://www.jstor.org/stable/2352757
  53. Investigating Explainability of Generative AI for Code through Scenario-Based Design. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 212–228. https://doi.org/10.1145/3490099.3511119
  54. Teo Susnjak. 2022. ChatGPT: The End of Online Exam Integrity? arXiv:cs.AI/2212.09292
  55. Mohsen Tavakol and Reg Dennick. 2011. Making sense of Cronbach’s alpha. International journal of medical education 2 (2011), 53.
  56. Jodie B. Ullman and Peter M. Bentler. 2003. Structural Equation Modeling. In Handbook of psychology: Research methods in psychology, Vol. 2. John Wiley and Sons, Inc., New York, NY, USA, 607–634. https://api.semanticscholar.org/CorpusID:53619206
  57. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
  58. Good intentions are not enough: how informatics interventions can worsen inequality. Journal of the American Medical Informatics Association 25, 8 (05 2018), 1080–1088. https://doi.org/10.1093/jamia/ocy052 arXiv:https://academic.oup.com/jamia/article-pdf/25/8/1080/34150998/ocy052.pdf
  59. The dark side of generative artificial intelligence: A critical analysis of controversies and risks of ChatGPT. Entrepreneurial Business and Economics Review 11, 2 (2023), 7–30.
  60. Towards Mutual Theory of Mind in Human-AI Interaction: How Language Reflects What Students Perceive About a Virtual Teaching Assistant. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 384, 14 pages. https://doi.org/10.1145/3411764.3445645
  61. Perfection Not Required? Human-AI Partnerships in Code Translation. In 26th International Conference on Intelligent User Interfaces (IUI ’21). Association for Computing Machinery, New York, NY, USA, 402–412. https://doi.org/10.1145/3397481.3450656
  62. Better Together? An Evaluation of AI-Supported Code Translation. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 369–391. https://doi.org/10.1145/3490099.3511157
  63. Complacency and automation bias in the use of imperfect automation. Human factors 57, 5 (2015), 728–739.
  64. Powering an AI Chatbot with Expert Sourcing to Support Credible Health Information Access. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 2–18. https://doi.org/10.1145/3581641.3584031
  65. Robot Capability and Intention in Trust-Based Decisions across Tasks. In Proceedings of the 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI ’19). IEEE Press, Daegu, Republic of Korea, 39–47.
  66. How to Guide Task-Oriented Chatbot Users, and When: A Mixed-Methods Study of Combinations of Chatbot Guidance Types and Timings. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 488, 16 pages. https://doi.org/10.1145/3491102.3501941
  67. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300509
  68. Resilience Through Appropriation: Pilots’ View on Complex Decision Support. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 397–409. https://doi.org/10.1145/3581641.3584056
  69. Productivity Assessment of Neural Code Completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 21–29. https://doi.org/10.1145/3520312.3534864
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Crystal Qian (7 papers)
  2. James Wexler (15 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets