Rethinking Model Evaluation as Narrowing the Socio-Technical Gap (2306.03100v3)
Abstract: The recent development of generative and LLMs poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.
- Ackerman, M. S. The intellectual challenge of cscw: The gap between social requirements and technical feasibility. Human–Computer Interaction, 15(2-3):179–203, 2000.
- Stories from the field: Reflections on hci4d experiences. Information Technologies & International Development, 5(4):pp–101, 2009.
- From mice to men-24 years of evaluation in chi. In Proceedings of the SIGCHI conference on Human factors in computing systems, volume 10. ACM New York, NY, 2007.
- Beaudouin-Lafon, M. Designing interaction, not interfaces. In Proceedings of the working conference on Advanced visual interfaces, pp. 15–22, 2004.
- The reprogen shared task on reproducibility of human evaluations in nlg: Overview and results. In Proceedings of the 14th International Conference on Natural Language Generation, pp. 249–258, 2021.
- Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1004–1015, 2021.
- Human-centered evaluation of explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pp. 26–32, 2022.
- The psychology of human-computer interaction. usa: L, 1983.
- Use-case-grounded simulations for explanation evaluation. In Advances in Neural Information Processing Systems.
- All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7282–7296, 2021.
- Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. arXiv preprint arXiv:2202.06935, 2022.
- Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
- Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 111–120, 2008.
- An automatic dialog simulation technique to develop and evaluate interactive conversational agents. Applied Artificial Intelligence, 27(9):759–780, 2013.
- Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–19, 2023.
- Teaching machines to read and comprehend. pp. 1693–1701, 2015.
- ACM SIGCHI curricula for human-computer interaction. ACM, 1992.
- Human factors in model interpretability: Industry practices, challenges, and needs. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1–26, 2020.
- Twenty years of confusion in human evaluation: Nlg needs evaluation sheets and standardised definition. Association for Computational Linguistics (ACL), 2020.
- Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471, 2021.
- Evaluation strategies for hci toolkit research. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2018.
- Cognitive walkthroughs. In Handbook of human-computer interaction, pp. 717–732. Elsevier, 1997.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Human-centered explainable ai (xai): From algorithms to user experiences. arXiv preprint arXiv:2110.10790, 2021.
- Questioning the ai: informing design practices for explainable ai user experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–15, 2020.
- Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 10, pp. 147–159, 2022.
- Designerly understanding: Information needs for model transparency to support design ideation for ai-powered user experience. In Proceedings of the 2023 CHI conference on human factors in computing systems, pp. 1–21, 2023.
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Ask not what ai can do, but what ai should do: Towards a framework of task delegability. Advances in Neural Information Processing Systems, 32, 2019.
- Changing perspectives on evaluation in hci: past, present, and future. In CHI’13 extended abstracts on human factors in computing systems, pp. 1969–1978. 2013.
- Dissociating language and thought in large language models: a cognitive perspective, 2023.
- Matias, J. N. Humans and algorithms work together—so study them together. Nature, 617(7960):248–251, 2023.
- McGrath, J. E. Methodology matters: Doing research in the behavioral and social sciences. In Readings in Human–Computer Interaction, pp. 152–169. Elsevier, 1995.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
- Nielsen, J. How to conduct a heuristic evaluation. Nielsen Norman Group, 1(1):8, 1995.
- Olsen Jr, D. R. Evaluating user interface systems research. In Proceedings of the 20th annual ACM symposium on User interface software and technology, pp. 251–258, 2007.
- The growth of cognitive modeling in human-computer interaction since goms. In Readings in Human–Computer Interaction, pp. 603–625. Elsevier, 1995.
- Ways of Knowing in HCI, volume 2. Springer, 2014.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pirolli, P. Cognitive models of human-information interaction. Handbook of applied cognition, pp. 443–470, 2007.
- Ai and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366, 2021.
- A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39, 2022.
- Schmuckler, M. A. What is ecological validity? a dimensional analysis. Infancy, 2(4):419–436, 2001.
- Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency, pp. 59–68, 2019.
- Sociotechnical harms: Scoping a taxonomy for harm reduction. arXiv preprint arXiv:2210.05791, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Beyond expertise and roles: A framework to characterize the stakeholders of interpretable machine learning and their needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2021.
- A human-centered agenda for intelligible machine learning. Machines We Trust: Getting Along with Artificial Intelligence, 2020.
- Emergent abilities of large language models. Transactions on Machine Learning Research.
- Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 214–229, 2022.
- Evaluating conversational recommender systems via user simulation. In Proceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pp. 1512–1520, 2020.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Deconstructing nlg evaluation: Evaluation practices, assumptions, and their implications. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 314–324, 2022.
- Q. Vera Liao (49 papers)
- Ziang Xiao (25 papers)