Collecting Qualitative Data at Scale with Large Language Models: A Case Study (2309.10187v3)
Abstract: Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in LLMs could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating "richness" or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.
- [n. d.]. GPT-3.5 and GPT-4 response times. https://www.taivo.ai/__gpt-3-5-and-gpt-4-response-times/. Accessed: September 14, 2023.
- [n. d.]. Orchestrate your AI with Semantic Kernel — Microsoft Learn. https://learn.microsoft.com/en-us/semantic-kernel/overview/. Accessed: September 14, 2023.
- Resilient chatbots: Repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–12.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Kelly Caine. 2016. Local standards for sample size at CHI. In Proceedings of the 2016 CHI conference on human factors in computing systems. 981–992.
- Typefaces and the Perception of Humanness in Natural Language Chatbots. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 3476–3487. https://doi.org/10.1145/3025453.3025919
- Justine Cassell. 2001. Embodied Conversational Agents: Representation and Intelligence in User Interfaces. AI Magazine 22, 4 (2001), 67–83.
- Frederick G Conrad and Michael F Schober. 2000. Clarifying question meaning in a household telephone survey. Public opinion quarterly 64, 1 (2000), 1–28.
- How to…do research interviews in different ways. The Clinical Teacher 15, 6 (2018), 451–456. https://doi.org/10.1111/tct.12953 arXiv:https://asmepublications.onlinelibrary.wiley.com/doi/pdf/10.1111/tct.12953
- Edith D De Leeuw et al. 2005. To mix or not to mix data collection modes in surveys. Journal of official statistics 21, 5 (2005), 233–255.
- Future directions for chatbot research: an interdisciplinary research agenda. Computing 103, 12 (2021), 2915–2942.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
- Eun Go and S. Shyam Sundar. 2019. Humanizing chatbots: The effects of visual, identity and conversational cues on humanness perceptions. Computers in Human Behavior 97 (2019), 304–316. https://doi.org/10.1016/j.chb.2019.01.020
- Saul Greenberg and Bill Buxton. 2008. Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI conference on Human factors in computing systems. 111–120.
- Jonathan Grudin and Richard Jacques. 2019. Chatbots, Humbots, and the Quest for Artificial General Intelligence. Conference on Human Factors in Computing Systems (2019), 1–11.
- Approaches for dialog management in conversational agents. IEEE Internet Computing 23, 2 (2018), 13–22.
- Applied survey data analysis. CRC press.
- Marek Hlavac. 2018. stargazer: Well-Formatted Regression and Summary Statistics Tables. https://CRAN.R-project.org/package=stargazer R package version 5.2.2.
- Embarking on large-scale qualitative research: Reaping the benefits of mixed methods in studying youth, clubs and drugs. Nordic Studies on Alcohol and Drugs 28, 5-6 (2011), 433–452.
- How different groups prioritize ethical values for responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 310–323.
- A review of key Likert scale development advances: 1995–2019. Frontiers in psychology 12 (2021), 637547.
- Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. Conference on Human Factors in Computing Systems (2019), 1–12.
- Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
- Michal Kosinski. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083 (2023).
- Simon Kühne and Martin Kroh. 2018. Personalized feedback in web surveys: Does it affect respondents’ motivation and data quality? Social Science Computer Review 36, 6 (2018), 744–755.
- Research methods in human-computer interaction. Morgan Kaufmann.
- Paul D Leedy and Jeanne Ellis Ormrod. 2015. Practical research. Pearson.
- Package ‘emmeans’. R package version 1, 8.8 (2023).
- Shing-On Leung. 2011. A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11-point Likert scales. Journal of social service research 37, 4 (2011), 412–421.
- Confiding in and listening to virtual agents: The effect of personality. In Proceedings of the 22nd International Conference on Intelligent User Interfaces. 275–286.
- Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL]
- How weird is CHI?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
- Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).
- Sharon L Lohr. 2021. Sampling: design and analysis. CRC press.
- Ewa Luger and Abigail Sellen. 2016. Like Having a Really Bad PA: The Gulf between User Expectation and Experience of Conversational Agents. Conference on Human Factors in Computing Systems (2016), 1–13.
- Philip M. McCarthy. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Ph. D. Dissertation. https://www.proquest.com/dissertations-theses/assessment-range-usefulness-lexical-diversity/docview/305349212/se-2
- Philip M McCarthy and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods 42, 2 (2010), 381–392.
- Verification strategies for establishing reliability and validity in qualitative research. International journal of qualitative methods 1, 2 (2002), 13–22.
- Audio, video, chat, email, or survey: How much does online interview mode matter? PloS one 17, 2 (2022), e0263876.
- OpenAI. 2023. GPT-4 Technical Report. Technical Report. OpenAI. https://arxiv.org/abs/2307.12008
- Michael Quinn Patton. 2014. Qualitative research & evaluation methods: Integrating theory and practice. Sage publications.
- Quid Pro Quo? Reciprocal Self-disclosure and Communicative Accomodation towards a Virtual Interviewer. In Intelligent Virtual Agents (Lecture Notes in Computer Science, Vol. 6895), Hannes Högni Vilhjálmsson, Stefan Kopp, Stacy Marsella, and Kristinn R. Thórisson (Eds.). Springer Berlin Heidelberg, 183–194. https://doi.org/10.1007/978-3-642-23974-8_20
- Qualtrics. 2020. Qualtrics. https://www.qualtrics.com/ Accessed July 2023.
- Application of humanization to survey chatbots: Change in chatbot perception, interaction experience, and survey data quality. Computers in Human Behavior 126 (2022), 107034.
- Herbert J Rubin and Irene S Rubin. 2011. Qualitative interviewing: The art of hearing data. sage.
- Jeffrey Rubin and Dana Chisnell. 2008. Handbook of usability testing: How to plan, design, and conduct effective tests. John Wiley & Sons.
- Stuart J Russell. 2010. Artificial intelligence a modern approach. Pearson Education, Inc.
- Jeff Sauro and James R Lewis. 2016. Quantifying the user experience: Practical statistics for user research. Morgan Kaufmann.
- Not Only WEIRD but “Uncanny”? A Systematic Review of Diversity in Human–Robot Interaction Research. International Journal of Social Robotics (2023), 1–30.
- Mario Luis Small and Jessica McCrory Calarco. 2022. Qualitative literacy: A guide to evaluating ethnographic and interview research. Univ of California Press.
- Humanizing self-administered surveys: experiments on social presence in web and IVR surveys. Computers in human behavior 19, 1 (2003), 1–24.
- Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399 (2023).
- Quid pro quo? Reciprocal self-disclosure and communicative accomodation towards a virtual interviewer. In Intelligent Virtual Agents: 10th International Conference, IVA 2011, Reykjavik, Iceland, September 15-17, 2011. Proceedings 11. Springer, 183–194.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382 [cs.SE]
- If I Hear You Correctly: Building and Evaluating Interview Chatbots with Active Listening Skills. Conference on Human Factors in Computing Systems (2020), 1–13.
- Tell Me About Yourself: Using an AI-Powered Chatbot to Conduct Conversational Surveys with Open-ended Questions. ACM Transactions on Computer-Human Interaction (TOCHI) 27, 3 (2020), 1–37.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2950–2968.
- Comparing Chatbots and Online Surveys for (Longitudinal) Data Collection: An Investigation of Response Characteristics, Data Quality, and User Evaluation. Communication Methods and Measures (2023), 1–20.
- An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability. Proceedings of the National Academy of Sciences 120, 33 (2023), e2302491120.
- Trusting Virtual Agents: The Effect of Personality. ACM Transactions on Interactive Intelligent Systems (TiiS) 9, 2-3 (2019), 10:1–10:36.
- Eva M. Brown (1 paper)
- Jennifer V. Scurrell (1 paper)
- Jason Entenmann (1 paper)
- Madeleine I. G. Daepp (2 papers)
- Alejandro Cuevas (5 papers)