Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks (2405.10632v5)

Published 17 May 2024 in cs.CY, cs.AI, and cs.HC

Abstract: Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (104)
  1. Mirages: On anthropomorphism in dialogue systems. arXiv preprint arXiv:2305.09800.
  2. The illusion of artificial inclusion. arXiv preprint arXiv:2401.08572.
  3. Designing for human rights in AI. Big Data & Society, 7(2): 2053951720949566.
  4. Anthropic. 2023. Anthropic’s responsible scaling policy.
  5. Artificial Intelligence Can Persuade Humans on Political Issues.
  6. A meta-analysis of the weight of advice in decision-making. Current Psychology, 42(28): 24516–24541.
  7. Understanding your users: a practical guide to user research methods. Morgan Kaufmann.
  8. STELA: a community-centred approach to norm elicitation for AI alignment. Scientific Reports, 14(1): 6616.
  9. People devalue generative AI’s competence but not its advice in addressing societal and personal challenges. Communications Psychology, 1(1): 32.
  10. My AI friend: How users of a social chatbot understand their human–AI friendship. Human Communication Research, 48(3): 404–429.
  11. Validation of a Decision Regret Scale. Medical decision making : an international journal of the Society for Medical Decision Making, 23: 281–92.
  12. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3): 1–45.
  13. The i-frame and the s-frame: How focusing on individual-level solutions has led behavioral public policy astray. Behavioral and Brain Sciences, 46: e147.
  14. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  15. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694.
  16. Evaluating quality in human-robot interaction: A systematic search and classification of performance and human-centered factors, measures and metrics towards an industry 5.0. Journal of Manufacturing Systems, 63: 392–410.
  17. Durably reducing conspiracy beliefs through dialogues with AI.
  18. Common Metrics to Benchmark Human-Machine Teams (HMT): A Review. IEEE Access, 6: 38637–38655.
  19. Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450391573.
  20. Generative artificial intelligence enhances creativity but reduces the diversity of novel content. arXiv:2312.00506.
  21. Duarte, F. 2024. Number of ChatGPT users (May 2024).
  22. Measuring the Persuasiveness of Language Models.
  23. Acceptable risks in Europe’s proposed AI Act: Reasonableness and other principles for deciding how much risk management is enough. arXiv:2308.02047.
  24. ”I wouldn’t say offensive but…”: Disability-Centered Perspectives on Large Language Models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, 205–216. New York, NY, USA: Association for Computing Machinery. ISBN 9798400701924.
  25. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 1747–1764.
  26. Challenges in evaluating AI systems.
  27. A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration. arXiv preprint arXiv:2404.00405.
  28. The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450380966.
  29. Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, 90–99. New York, NY, USA: Association for Computing Machinery. ISBN 9781450361255.
  30. The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective. SSRN Electronic Journal.
  31. Comparing the persuasiveness of role-playing large language models and human experts on polarized US political issues.
  32. Evaluating the persuasive influence of political microtargeting with large language models.
  33. Randomised controlled trials - the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG, 125(13): 1716. Epub 2018 Jun 19.
  34. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509.
  35. Hutson, M. 2018. Artificial intelligence faces reproducibility crisis.
  36. Händler, T. 2023. A Taxonomy for Autonomous LLM-Powered Multi-Agent Architectures.
  37. Characterizing and modeling harms from interactions with design patterns in AI interfaces. arXiv:2404.11370.
  38. Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394215.
  39. Investigating data contamination for pre-training language models. arXiv preprint arXiv:2401.06059.
  40. Emergence of Power Laws in Online Communities: The Role of Social Mechanisms and Preferential Attachment. MIS Quarterly, 38(3): 795–A13.
  41. On the Societal Impact of Open Foundation Models. arXiv:2403.07918.
  42. How to Prepare for the Deluge of Generative AI on Social Media. Knight First Amendment Institute.
  43. Working with AI to persuade: Examining a large language model’s ability to generate pro-vaccination messages. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1): 1–29.
  44. Working With AI to Persuade: Examining a Large Language Model’s Ability to Generate Pro-Vaccination Messages. Proc. ACM Hum.-Comput. Interact., 7(CSCW1).
  45. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337.
  46. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019.
  47. Human Decisions and Machine Predictions*. The Quarterly Journal of Economics, 133(1): 237–293.
  48. Breaking Feedback Loops in Recommender Systems with Causal Inference. arXiv:2207.01616.
  49. Kuniavsky, M. 2003. In Observing the User Experience, xiii–xvi. San Francisco: Morgan Kaufmann. ISBN 978-1-55860-923-5.
  50. End-user audits: A system empowering communities to lead large-scale investigations of harmful algorithmic behavior. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2): 1–34.
  51. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, 1–19.
  52. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746.
  53. The demographic and political composition of Mechanical Turk samples. Sage Open, 6(1): 2158244016636433.
  54. Theory of Mind for Multi-Agent Collaboration via Large Language Models. arXiv:2310.10701.
  55. Ditch the gold standard: Re-evaluating conversational question answering. arXiv preprint arXiv:2112.08812.
  56. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  57. Designing for Responsible Trust in AI Systems: A Communication Perspective. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, 1257–1268. New York, NY, USA: Association for Computing Machinery. ISBN 9781450393522.
  58. Rethinking model evaluation as narrowing the socio-technical gap. arXiv preprint arXiv:2306.03100.
  59. Decision-oriented dialogue for human-ai collaboration. arXiv preprint arXiv:2305.20076.
  60. METR. 2023. METR Autonomy Evaluations Resources.
  61. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  62. Mitchell, M. 2024. Ethical ai isn’t to blame for Google’s Gemini debacle.
  63. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. The lancet, 357(9263): 1191–1194.
  64. Researcher Access to Social Media Data: Lessons from Clinical Trial Data Sharing. Berkeley Technology Law Journal, Forthcoming.
  65. The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. Santa Monica, CA: RAND Corporation.
  66. Putting ChatGPT’s Medical Advice to the (Turing) Test. arXiv:2301.10035.
  67. OpenAI. 2023. Preparedness.
  68. The shifted and the overlooked: a task-oriented investigation of user-gpt interactions. arXiv preprint arXiv:2310.12418.
  69. Feedback Loops With Language Models Drive In-Context Reward Hacking. arXiv preprint arXiv:2402.06627.
  70. BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193.
  71. Building an early warning system for LLM-aided biological …
  72. Exploring relationship development with social chatbots: A mixed-method study of replika. Computers in Human Behavior, 140: 107600.
  73. Evaluating Frontier Models for Dangerous Capabilities. arXiv preprint arXiv:2403.13793.
  74. AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366.
  75. The fallacy of AI functionality. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 959–972.
  76. How credible are the study results? Evaluating and applying internal validity tools to literature-based assessments of environmental health hazards. Environment international, 92: 617–629.
  77. SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety. arXiv preprint arXiv:2404.05399.
  78. Scaling laws of human interaction activity. Proceedings of the National Academy of Sciences, 106(31): 12640–12645.
  79. Lost at c: A user study on the security implications of large language model code assistants. In 32nd USENIX Security Symposium (USENIX Security 23), 2205–2222.
  80. Not Only WEIRD but “Uncanny”? A Systematic Review of Diversity in Human–Robot Interaction Research. International Journal of Social Robotics, 15(11): 1841–1870.
  81. ShareGPT. 2022. ShareGPT.
  82. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
  83. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741.
  84. Everyday algorithm auditing: Understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2): 1–29.
  85. Participation Is not a Design Fix for Machine Learning. In Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’22. New York, NY, USA: Association for Computing Machinery. ISBN 9781450394772.
  86. Can large language models democratize access to dual-use biotechnology? arXiv:2306.03809.
  87. Evaluating the Social Impact of Generative AI Systems in Systems and Society. arXiv:2306.05949.
  88. AI model GPT-3 (dis) informs us better than humans. Science Advances, 9(26): eadh1850.
  89. Bridging the Gulf of Envisioning: Cognitive Design Challenges in LLM Interfaces. arXiv:2309.14459.
  90. Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction, 112(7): 1210–1219.
  91. The White House. 2023. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/.
  92. UK AI Safety Institute. 2024. Inspect AI.
  93. UK Government. 2024. Ai Safety Institute approach to evaluations.
  94. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
  95. The Reasons that Agents Act: Intention and Instrumental Goals. arXiv:2402.07221.
  96. Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986.
  97. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 214–229.
  98. Wilson, T. D. 2000. Recent trends in user studies: action research and qualitative methods. Information Research, 5(3).
  99. Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. JAMA, 323(9): 844–853.
  100. Friend, mentor, lover: does chatbot engagement lead to psychological dependence? Journal of service Management, 34(4): 806–828.
  101. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2950–2968.
  102. ChatGPT vs. Google: a comparative study of search performance and user experience. arXiv preprint arXiv:2307.01135.
  103. “It’s a Fair Game”, or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24. ACM.
  104. WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Lujain Ibrahim (8 papers)
  2. Saffron Huang (10 papers)
  3. Lama Ahmad (8 papers)
  4. Markus Anderljung (29 papers)
Citations (15)

Summary

Understanding the Importance of Human Interaction Evaluations for AI Models

Background and Context

When we talk about evaluating AI models, we're typically thinking of how they perform in clinical, controlled conditions—like running a car engine in a lab rather than on a busy highway. Evaluations usually focus on how well these models handle isolated tasks such as answering questions directly or identifying objects in images. But what about when the rubber hits the road? Or in this case, when the model starts interacting with people in real-world applications?

The paper we're diving into discusses a gap in current AI evaluations and proposes a new paradigm to address it: Human Interaction Evaluations (HIEs). The authors argue that while current evaluations are informative, they fall short in capturing the intricacies of human-AI interactions. They aim to fill this void by introducing a framework for HIEs that specifically targets human-LLM interactions.

The Case for Human Interaction Evaluations

Defining HIEs

The term "Human Interaction Evaluations" might sound technical, but it's essentially about assessing how well AI models work when real humans are involved. This includes not just whether the models perform well in controlled conditions, but how they fare in the messy, unpredictable real world. The paper describes different ways HIEs can bring new insights:

  • Increasing Evaluation Validity: By including human users, HIEs offer richer data and context, ultimately leading to more accurate and generalizable evaluations.
  • Assessing Direct Human Impact: Unlike traditional evaluations, HIEs can assess the immediate effects of AI interactions on people—whether it's changing their beliefs, affecting their decisions, or even just causing harm.
  • Guiding Societal Impact Assessments: By understanding individual-level impacts, we can better anticipate societal implications, helping to shape policies and regulations that mitigate AI risks.

Why Current Evaluations Fall Short

Traditional AI evaluations focus heavily on static benchmarks, checking for biases, harmful outputs, or other risks from a model in isolation. But this doesn't cover the "sociotechnical gap," which occurs because:

  1. Joint Performance Gaps: Many AI applications require human interaction, but most benchmarks do not account for this.
  2. Evaluation Task Misalignment: Real-world tasks often differ significantly from benchmark tasks.
  3. Human Impact: Static evaluations can't fully explore how AI affects its users.

A Framework for Conducting HIEs

The authors present a three-stage framework for designing HIEs that can help researchers more effectively evaluate AI models' safety and performance in real-world scenarios.

Stage 1: Identifying the Risk and/or Harm Area

The first step is to clearly define the real-world problems you want to address—whether it’s biases in the hiring process or persuasion risks in political opinion shaping. The paper categorizes risks into three types:

  • Absolute Risks: Directly evaluating the chances and severity of harm from the AI model.
  • Marginal Risks: Comparing the risks from the AI model to some baseline (e.g., human decision-making).
  • Residual Risks: Assessing remaining risks after safety mitigations.

Stage 2: Characterizing the Use Context

Once you know the risk area, the next step is to set up a context for evaluation that closely mirrors real-world usage:

  • Harmful Use Scenarios: Define whether the risk comes from misuse, unintended personal impact, or unintended external impact.
  • User, Model, and System Dimensions: Consider who the users are (e.g., technical literacy), details about the model (e.g., size, datasets), and system architecture (e.g., supporting tools).
  • Interaction Modes and Tasks: Define how the human and model will interact. This could be collaboration, direction, assistance, cooperation, or exploration.

Stage 3: Choosing Evaluation Parameters

The final step involves selecting the evaluation targets and metrics:

  • Evaluation Target: Decide whether to focus on the interaction process or the outcome.
  • Metrics: Use both subjective metrics (e.g., user satisfaction) and objective metrics (e.g., task accuracy) for comprehensive insights.

Example Evaluations

To make things concrete, the paper provides two detailed examples:

  • Overreliance Risks: Examines how hiring managers use AI for decision-making and whether it introduces an overreliance on model output.
  • Persuasion Risks: Looks at how AI can amplify the persuasive power of messages in political opinion pieces.

Both cases illustrate how detailed planning and context-specific strategies can lead to useful, actionable insights.

Practical Implications and Future Directions

The introduction of HIEs marks an important shift in how we evaluate AI safety and effectiveness. By simulating real-world interactions, these evaluations can highlight previously unseen risks and inform better design and regulatory practices.

Recommendations for the Field

  • Invest in HIE Development: More funds and efforts should go into creating and refining HIEs.
  • Leverage Established Methods: Utilize best practices from fields like Human-Computer Interaction (HCI) and experimental psychology to develop rigorous evaluations.
  • Broaden Representation: Ensure diverse user groups are included to make evaluations more representative.
  • Address Ethical Concerns: Careful design can mitigate ethical issues, such as ensuring participants are not exposed to harmful content unnecessarily.

Conclusion

Human Interaction Evaluations offer a promising way to bridge the gap between how AI models perform in isolation and their real-world applications. By incorporating the complexity of human interactions, these evaluations can provide a more holistic view of AI safety and impact, ultimately leading to better, safer AI systems.