Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks (2405.10632v5)
Abstract: Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.
- Mirages: On anthropomorphism in dialogue systems. arXiv preprint arXiv:2305.09800.
- The illusion of artificial inclusion. arXiv preprint arXiv:2401.08572.
- Designing for human rights in AI. Big Data & Society, 7(2): 2053951720949566.
- Anthropic. 2023. Anthropic’s responsible scaling policy.
- Artificial Intelligence Can Persuade Humans on Political Issues.
- A meta-analysis of the weight of advice in decision-making. Current Psychology, 42(28): 24516–24541.
- Understanding your users: a practical guide to user research methods. Morgan Kaufmann.
- STELA: a community-centred approach to norm elicitation for AI alignment. Scientific Reports, 14(1): 6616.
- People devalue generative AI’s competence but not its advice in addressing societal and personal challenges. Communications Psychology, 1(1): 32.
- My AI friend: How users of a social chatbot understand their human–AI friendship. Human Communication Research, 48(3): 404–429.
- Validation of a Decision Regret Scale. Medical decision making : an international journal of the Society for Medical Decision Making, 23: 281–92.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3): 1–45.
- The i-frame and the s-frame: How focusing on individual-level solutions has led behavioral public policy astray. Behavioral and Brain Sciences, 46: e147.
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
- Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694.
- Evaluating quality in human-robot interaction: A systematic search and classification of performance and human-centered factors, measures and metrics towards an industry 5.0. Journal of Manufacturing Systems, 63: 392–410.
- Durably reducing conspiracy beliefs through dialogues with AI.
- Common Metrics to Benchmark Human-Machine Teams (HMT): A Review. IEEE Access, 6: 38637–38655.
- Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450391573.
- Generative artificial intelligence enhances creativity but reduces the diversity of novel content. arXiv:2312.00506.
- Duarte, F. 2024. Number of ChatGPT users (May 2024).
- Measuring the Persuasiveness of Language Models.
- Acceptable risks in Europe’s proposed AI Act: Reasonableness and other principles for deciding how much risk management is enough. arXiv:2308.02047.
- ”I wouldn’t say offensive but…”: Disability-Centered Perspectives on Large Language Models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, 205–216. New York, NY, USA: Association for Computing Machinery. ISBN 9798400701924.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 1747–1764.
- Challenges in evaluating AI systems.
- A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration. arXiv preprint arXiv:2404.00405.
- The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450380966.
- Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, 90–99. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361255.
- The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective. SSRN Electronic Journal.
- Comparing the persuasiveness of role-playing large language models and human experts on polarized US political issues.
- Evaluating the persuasive influence of political microtargeting with large language models.
- Randomised controlled trials - the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG, 125(13): 1716. Epub 2018 Jun 19.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509.
- Hutson, M. 2018. Artificial intelligence faces reproducibility crisis.
- Händler, T. 2023. A Taxonomy for Autonomous LLM-Powered Multi-Agent Architectures.
- Characterizing and modeling harms from interactions with design patterns in AI interfaces. arXiv:2404.11370.
- Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394215.
- Investigating data contamination for pre-training language models. arXiv preprint arXiv:2401.06059.
- Emergence of Power Laws in Online Communities: The Role of Social Mechanisms and Preferential Attachment. MIS Quarterly, 38(3): 795–A13.
- On the Societal Impact of Open Foundation Models. arXiv:2403.07918.
- How to Prepare for the Deluge of Generative AI on Social Media. Knight First Amendment Institute.
- Working with AI to persuade: Examining a large language model’s ability to generate pro-vaccination messages. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1): 1–29.
- Working With AI to Persuade: Examining a Large Language Model’s Ability to Generate Pro-Vaccination Messages. Proc. ACM Hum.-Comput. Interact., 7(CSCW1).
- Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337.
- The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019.
- Human Decisions and Machine Predictions*. The Quarterly Journal of Economics, 133(1): 237–293.
- Breaking Feedback Loops in Recommender Systems with Causal Inference. arXiv:2207.01616.
- Kuniavsky, M. 2003. In Observing the User Experience, xiii–xvi. San Francisco: Morgan Kaufmann. ISBN 978-1-55860-923-5.
- End-user audits: A system empowering communities to lead large-scale investigations of harmful algorithmic behavior. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2): 1–34.
- Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, 1–19.
- Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746.
- The demographic and political composition of Mechanical Turk samples. Sage Open, 6(1): 2158244016636433.
- Theory of Mind for Multi-Agent Collaboration via Large Language Models. arXiv:2310.10701.
- Ditch the gold standard: Re-evaluating conversational question answering. arXiv preprint arXiv:2112.08812.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Designing for Responsible Trust in AI Systems: A Communication Perspective. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, 1257–1268. New York, NY, USA: Association for Computing Machinery. ISBN 9781450393522.
- Rethinking model evaluation as narrowing the socio-technical gap. arXiv preprint arXiv:2306.03100.
- Decision-oriented dialogue for human-ai collaboration. arXiv preprint arXiv:2305.20076.
- METR. 2023. METR Autonomy Evaluations Resources.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- Mitchell, M. 2024. Ethical ai isn’t to blame for Google’s Gemini debacle.
- The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. The lancet, 357(9263): 1191–1194.
- Researcher Access to Social Media Data: Lessons from Clinical Trial Data Sharing. Berkeley Technology Law Journal, Forthcoming.
- The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. Santa Monica, CA: RAND Corporation.
- Putting ChatGPT’s Medical Advice to the (Turing) Test. arXiv:2301.10035.
- OpenAI. 2023. Preparedness.
- The shifted and the overlooked: a task-oriented investigation of user-gpt interactions. arXiv preprint arXiv:2310.12418.
- Feedback Loops With Language Models Drive In-Context Reward Hacking. arXiv preprint arXiv:2402.06627.
- BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193.
- Building an early warning system for LLM-aided biological …
- Exploring relationship development with social chatbots: A mixed-method study of replika. Computers in Human Behavior, 140: 107600.
- Evaluating Frontier Models for Dangerous Capabilities. arXiv preprint arXiv:2403.13793.
- AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366.
- The fallacy of AI functionality. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 959–972.
- How credible are the study results? Evaluating and applying internal validity tools to literature-based assessments of environmental health hazards. Environment international, 92: 617–629.
- SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety. arXiv preprint arXiv:2404.05399.
- Scaling laws of human interaction activity. Proceedings of the National Academy of Sciences, 106(31): 12640–12645.
- Lost at c: A user study on the security implications of large language model code assistants. In 32nd USENIX Security Symposium (USENIX Security 23), 2205–2222.
- Not Only WEIRD but “Uncanny”? A Systematic Review of Diversity in Human–Robot Interaction Research. International Journal of Social Robotics, 15(11): 1841–1870.
- ShareGPT. 2022. ShareGPT.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
- Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.
- Everyday algorithm auditing: Understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2): 1–29.
- Participation Is not a Design Fix for Machine Learning. In Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394772.
- Can large language models democratize access to dual-use biotechnology? arXiv:2306.03809.
- Evaluating the Social Impact of Generative AI Systems in Systems and Society. arXiv:2306.05949.
- AI model GPT-3 (dis) informs us better than humans. Science Advances, 9(26): eadh1850.
- Bridging the Gulf of Envisioning: Cognitive Design Challenges in LLM Interfaces. arXiv:2309.14459.
- Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction, 112(7): 1210–1219.
- The White House. 2023. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/.
- UK AI Safety Institute. 2024. Inspect AI.
- UK Government. 2024. Ai Safety Institute approach to evaluations.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
- The Reasons that Agents Act: Intention and Instrumental Goals. arXiv:2402.07221.
- Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986.
- Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 214–229.
- Wilson, T. D. 2000. Recent trends in user studies: action research and qualitative methods. Information Research, 5(3).
- Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. JAMA, 323(9): 844–853.
- Friend, mentor, lover: does chatbot engagement lead to psychological dependence? Journal of service Management, 34(4): 806–828.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2950–2968.
- ChatGPT vs. Google: a comparative study of search performance and user experience. arXiv preprint arXiv:2307.01135.
- “It’s a Fair Game”, or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24. ACM.
- WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470.
- Lujain Ibrahim (8 papers)
- Saffron Huang (10 papers)
- Lama Ahmad (8 papers)
- Markus Anderljung (29 papers)