Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models (2405.18638v2)

Published 28 May 2024 in cs.CL and cs.AI

Abstract: In this position paper, we argue that human evaluation of generative LLMs should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful LLMs -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (113)
  1. Quality control questions on amazon’s mechanical turk (mturk): A randomized trial of impact on the usaudit, phq-9, and gad-7. Behavior Research Methods, 54(2):885–897.
  2. Cognitively inspired task design to improve user performance on crowdsourcing platforms. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, page 3665–3674, New York, NY, USA. Association for Computing Machinery.
  3. I like it… i like it not: Evaluating user ratings noise in recommender systems. In User Modeling, Adaptation, and Personalization, pages 247–258, Berlin, Heidelberg. Springer Berlin Heidelberg.
  4. Rate it again: increasing recommendation accuracy by user re-rating. In Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, page 173–180, New York, NY, USA. Association for Computing Machinery.
  5. Agreement is overrated: A plea for correlation to assess human evaluation reliability. In Proceedings of the 12th International Conference on Natural Language Generation, pages 344–354, Tokyo, Japan. Association for Computational Linguistics.
  6. Andreas Uebelbacher Andreas Sonderegger, Gerold Zbinden and Juergen Sauer. 2012. The influence of product aesthetics and usability over the course of time: a longitudinal field experiment. Ergonomics, 55(7):713–730. PMID: 22506866.
  7. Ron Artstein. 2017. Inter-annotator Agreement, pages 297–313. Springer Netherlands, Dordrecht.
  8. Leif Azzopardi. 2021. Cognitive biases in search: A review and reflection of cognitive biases in information retrieval. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, CHIIR ’21, page 27–37, New York, NY, USA. Association for Computing Machinery.
  9. Robert W. Bailey. 1993. Performance vs. preference. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 37(4):282–286.
  10. Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604):452–454.
  11. Carol M Barnum. 2020. Usability testing essentials: Ready, set… test! Morgan Kaufmann.
  12. Anja Belz and Eric Kow. 2010. Comparing rating scales and preference judgements in language evaluation. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics.
  13. Disentangling the properties of human evaluation methods: A classification system to support comparability, meta-evaluation and reproducibility testing. In Proceedings of the 13th International Conference on Natural Language Generation, pages 183–194, Dublin, Ireland. Association for Computational Linguistics.
  14. Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
  15. Big bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  16. Phillip A Bishop and Robert L Herron. 2015. Use and misuse of the likert item responses and other ordinal measures. Int. J. Exerc. Sci., 8(3):297–302.
  17. Nadia M. Brashier and Elizabeth J. Marsh. 2020. Judging truth. Annual Review of Psychology, 71(1):499–515. PMID: 31514579.
  18. Aljoscha Burchardt. 2013. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35, London, UK. Aslib.
  19. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5):365–376.
  20. A customized text sanitization mechanism with differential privacy. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5747–5758, Toronto, Canada. Association for Computational Linguistics.
  21. Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
  22. Bernard C K Choi and Anita W P Pak. 2005. A catalog of biases in questionnaires. Prev. Chronic Dis., 2(1):A13.
  23. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
  24. Re-examining system-level correlations of automatic summarization evaluation metrics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 6038–6052, Seattle, United States. Association for Computational Linguistics.
  25. Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, page 162–170, New York, NY, USA. Association for Computing Machinery.
  26. Principles from clinical research for nlp model generalization.
  27. Memorization vs. generalization : Quantifying data leakage in NLP performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, Online. Association for Computational Linguistics.
  28. Geoffrey Ellis. 2018. So, What Are Cognitive Biases?, pages 1–10. Springer International Publishing, Cham.
  29. ROBBIE: Robust bias evaluation of large generative language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3764–3814, Singapore. Association for Computational Linguistics.
  30. SummEval: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  31. Lawbench: Benchmarking legal knowledge of large language models.
  32. Accelerating human authorship of information extraction rules. In Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning, pages 45–55, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
  33. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  34. Adrian Furnham and Hua Chu Boo. 2011. A literature review of the anchoring effect. The Journal of Socio-Economics, 40(1):35–42.
  35. Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing. In Proceedings of the 28th ACM Conference on Hypertext and Social Media, HT ’17, page 5–14, New York, NY, USA. Association for Computing Machinery.
  36. Mingqi Gao and Xiaojun Wan. 2022. DialSummEval: Revisiting summarization evaluation for dialogues. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5693–5709, Seattle, United States. Association for Computational Linguistics.
  37. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9(3):330–338.
  38. Rainer Greifeneder and Herbert Bless. 2017. The interplay of cognition and feelings: Fluency. Social cognition, pages 145–164.
  39. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  40. Mika Hämäläinen and Khalid Alnajjar. 2021. The great misalignment problem in human evaluation of NLP methods. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 69–74, Online. Association for Computational Linguistics.
  41. The interplay between usability and aesthetics: More evidence for the “what is usable is beautiful” notion. Advances in Human-Computer Interaction, 2014:946239.
  42. Rex Hartson and Pardha S Pyla. 2012. The UX Book: Process and guidelines for ensuring a quality user experience. Elsevier.
  43. Douglas Hayhoe. 1990. Sorting-based menu categories. International Journal of Man-Machine Studies, 33(6):677–705.
  44. Recommending and evaluating choices in a virtual community of use. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 194–201.
  45. Reduce human labor on evaluating conversational information retrieval system: A human-machine collaboration approach. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10876–10891, Singapore. Association for Computational Linguistics.
  46. Reliability of human evaluation for text summarization: Lessons learned and challenges ahead. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 86–96, Online. Association for Computational Linguistics.
  47. Kevin Jasberg and Sergej Sizov. 2020. Human uncertainty in explicit user feedback and its impact on the comparative evaluations of accurate prediction and personalisation. Behaviour & Information Technology, 39(5):544–577.
  48. Improved recommender systems by denoising ratings in highly sparse datasets through individual rating confidence. Information Sciences, 601:242–254.
  49. Big data and large sample size: a cautionary note on the potential for bias. Clin. Transl. Sci., 7(4):342–346.
  50. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
  51. NamHyeok Kim and Chanjun Park. 2023. Inter-annotator agreement in the wild: Uncovering its emerging roles and considerations in real-world scenarios.
  52. Alex S. Koch and Joseph P. Forgas. 2012. Feeling good and feeling truth: The interactive effects of mood and processing fluency on truth judgments. Journal of Experimental Social Psychology, 48(2):481–485.
  53. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861.
  54. Rating consistency is consistently underrated: an exploratory analysis of movie-tag rating inconsistency. Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing.
  55. Annie Y.S. Lau and Enrico W. Coiera. 2007. Do People Experience Cognitive Biases while Searching for Information? Journal of the American Medical Informatics Association, 14(5):599–608.
  56. Shing-On Leung. 2011. A comparison of psychometric properties and normality in 4-, 5-, 6-, and 11-point likert scales. Journal of Social Service Research, 37(4):412–421.
  57. Quality estimation for image captions based on large-scale human evaluations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3157–3166, Online. Association for Computational Linguistics.
  58. Multi-step jailbreaking privacy attacks on ChatGPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4138–4153, Singapore. Association for Computational Linguistics.
  59. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  60. Opportunities for human-centered evaluation of machine translation systems. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229–240, Seattle, United States. Association for Computational Linguistics.
  61. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  62. Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pages 47–58, Toronto, Canada. Association for Computational Linguistics.
  63. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
  64. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  65. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
  66. Understanding UX better: A new technique to go beyond emotion assessment. Sensors (Basel), 21(21):7183.
  67. K-alpha calculator–krippendorff’s alpha calculator: A user-friendly tool for computing krippendorff’s alpha inter-rater reliability coefficient. MethodsX, 12:102545.
  68. Is psychology suffering from a replication crisis? what does “failure to replicate” really mean? Am. Psychol., 70(6):487–498.
  69. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  70. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  71. Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochem. Med. (Zagreb), 22(3):276–282.
  72. ASSERT: Automated safety scenario red teaming for evaluating the robustness of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5831–5847, Singapore. Association for Computational Linguistics.
  73. A formal analysis of the iso 9241-210 definition of user experience. In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA ’15, page 437–450, New York, NY, USA. Association for Computing Machinery.
  74. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  75. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
  76. Research study on importance of usability testing/user experience (ux) testing. International Journal of Computer Science and Mobile Computing, 3(10):78–85.
  77. Richard Nisbett and Timothy Wilson. 1977. The halo effect: Evidence for unconscious alteration of judgments. Journal of Personality and Social Psychology, 35:250–256.
  78. Thomas A. O’Neill. 2017. An overview of interrater agreement on likert scales for researchers and practitioners. Frontiers in Psychology, 8.
  79. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  80. Designing interfaces for explicit preference elicitation: a user-centered investigation of preference representation and elicitation process. User Modeling and User-Adapted Interaction, 22(4):357–397.
  81. In search of ambiguity: A three-stage workflow design to clarify annotation guidelines for crowd workers. Front. Artif. Intell., 5:828187.
  82. Statistically controlling for processing fluency reduces the aesthetic-usability effect. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, CHI EA ’23, New York, NY, USA. Association for Computing Machinery.
  83. AART: AI-assisted red-teaming with diverse data generation for new LLM-powered applications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 380–395, Singapore. Association for Computational Linguistics.
  84. Rolf Reber. 2011. 223Processing Fluency, Aesthetic Pleasure, and Culturally Shared Taste. In Aesthetic Science: Connecting Minds, Brains, and Experience. Oxford University Press.
  85. Rolf Reber and Norbert Schwarz. 1999. Effects of perceptual fluency on judgments of truth. Consciousness and Cognition: An International Journal, 8(3):338–342.
  86. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  87. Anna Rogers. 2021. Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2182–2194, Online. Association for Computational Linguistics.
  88. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  89. On the robustness of offensive language classifiers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7424–7438, Dublin, Ireland. Association for Computational Linguistics.
  90. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  91. Studying the effects of cognitive biases in evaluation of conversational agents. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, page 1–13, New York, NY, USA. Association for Computing Machinery.
  92. “this is a problem, don’t you agree?” framing and bias in human evaluation for natural language generation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 10–16, Online (Dublin, Ireland). Association for Computational Linguistics.
  93. Norbert Schwarz. 2006. on judgments of truth & beauty. Daedalus, 135(2):136–138.
  94. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
  95. Andreas Sonderegger and Juergen Sauer. 2010. The influence of design aesthetics in usability testing: Effects on user performance and perceived usability. Applied Ergonomics, 41(3):403–410. Special Section: Recycling centres and waste handling – a workplace for employees and users.
  96. Poorna Talkad Sukumar and Ronald Metoyer. 2018. Towards designing unbiased replication studies in information visualization. In 2018 IEEE Evaluation and Beyond - Methodological Approaches for Visualization (BELIV), pages 93–101.
  97. Trustllm: Trustworthiness in large language models.
  98. Hamed Taherdoost. 2019. What Is the Best Response Scale for Survey and Questionnaire Design; Review of Different Lengths of Rating Scale / Attitude Scale / Likert Scale. Post-Print hal-02557308, HAL.
  99. What’s the meaning of superhuman performance in today’s NLU? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12471–12491, Toronto, Canada. Association for Computational Linguistics.
  100. On the usefulness of interrater reliability coefficients. In Quantitative Psychology, pages 67–75, Cham. Springer International Publishing.
  101. Claire I Tsai and Manoj Thomas. 2011. When does feeling of fluency matter? how abstract and concrete thinking influence fluency effects. Psychological Science, 22(3):348–354.
  102. Is beautiful really usable? toward understanding the relation between usability, aesthetics, and affect in hci. Computers in Human Behavior, 28(5):1596–1607.
  103. Tom Tullis and Bill Albert. 2013. Chapter 3 - planning. In Tom Tullis and Bill Albert, editors, Measuring the User Experience (Second Edition), second edition edition, Interactive Technologies, pages 41–62. Morgan Kaufmann, Boston.
  104. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131.
  105. Thomas Vakili and Hercules Dalianis. 2023. Using membership inference attacks to evaluate privacy-preserving language modeling fails for pseudonymizing data. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 318–323, Tórshavn, Faroe Islands. University of Tartu Library.
  106. Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67:101151.
  107. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan. Association for Computational Linguistics.
  108. vocabulary.com. https://www.vocabulary.com/dictionary/differentiate. Accessed: 2024-4-5.
  109. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  110. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  111. Ryen W. White. 2013. Beliefs and biases in web search. In 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), Dublin, Ireland, pages 3–12. Best Paper Award.
  112. Another look at likert scales. Journal of Rural Social Sciences, 31(3):6.
  113. Meng-Han Wu and Alexander Quinn. 2017. Confusing the crowd: Task instruction quality on amazon mechanical turk. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 5(1):206–215.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aparna Elangovan (8 papers)
  2. Ling Liu (132 papers)
  3. Lei Xu (172 papers)
  4. Sravan Bodapati (31 papers)
  5. Dan Roth (222 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets