Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap (2306.03100v3)

Published 1 Jun 2023 in cs.HC and cs.AI

Abstract: The recent development of generative and LLMs poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Ackerman, M. S. The intellectual challenge of cscw: The gap between social requirements and technical feasibility. Human–Computer Interaction, 15(2-3):179–203, 2000.
  2. Stories from the field: Reflections on hci4d experiences. Information Technologies & International Development, 5(4):pp–101, 2009.
  3. From mice to men-24 years of evaluation in chi. In Proceedings of the SIGCHI conference on Human factors in computing systems, volume 10. ACM New York, NY, 2007.
  4. Beaudouin-Lafon, M. Designing interaction, not interfaces. In Proceedings of the working conference on Advanced visual interfaces, pp.  15–22, 2004.
  5. The reprogen shared task on reproducibility of human evaluations in nlg: Overview and results. In Proceedings of the 14th International Conference on Natural Language Generation, pp.  249–258, 2021.
  6. Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1004–1015, 2021.
  7. Human-centered evaluation of explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pp.  26–32, 2022.
  8. The psychology of human-computer interaction. usa: L, 1983.
  9. Use-case-grounded simulations for explanation evaluation. In Advances in Neural Information Processing Systems.
  10. All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7282–7296, 2021.
  11. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  12. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
  13. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. arXiv preprint arXiv:2202.06935, 2022.
  14. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
  15. Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI conference on Human factors in computing systems, pp.  111–120, 2008.
  16. An automatic dialog simulation technique to develop and evaluate interactive conversational agents. Applied Artificial Intelligence, 27(9):759–780, 2013.
  17. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.  1–19, 2023.
  18. Teaching machines to read and comprehend. pp.  1693–1701, 2015.
  19. ACM SIGCHI curricula for human-computer interaction. ACM, 1992.
  20. Human factors in model interpretability: Industry practices, challenges, and needs. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1–26, 2020.
  21. Twenty years of confusion in human evaluation: Nlg needs evaluation sheets and standardised definition. Association for Computational Linguistics (ACL), 2020.
  22. Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471, 2021.
  23. Evaluation strategies for hci toolkit research. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp.  1–17, 2018.
  24. Cognitive walkthroughs. In Handbook of human-computer interaction, pp.  717–732. Elsevier, 1997.
  25. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  26. Human-centered explainable ai (xai): From algorithms to user experiences. arXiv preprint arXiv:2110.10790, 2021.
  27. Questioning the ai: informing design practices for explainable ai user experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp.  1–15, 2020.
  28. Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 10, pp.  147–159, 2022.
  29. Designerly understanding: Information needs for model transparency to support design ideation for ai-powered user experience. In Proceedings of the 2023 CHI conference on human factors in computing systems, pp.  1–21, 2023.
  30. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  31. Ask not what ai can do, but what ai should do: Towards a framework of task delegability. Advances in Neural Information Processing Systems, 32, 2019.
  32. Changing perspectives on evaluation in hci: past, present, and future. In CHI’13 extended abstracts on human factors in computing systems, pp.  1969–1978. 2013.
  33. Dissociating language and thought in large language models: a cognitive perspective, 2023.
  34. Matias, J. N. Humans and algorithms work together—so study them together. Nature, 617(7960):248–251, 2023.
  35. McGrath, J. E. Methodology matters: Doing research in the behavioral and social sciences. In Readings in Human–Computer Interaction, pp.  152–169. Elsevier, 1995.
  36. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  37. Nielsen, J. How to conduct a heuristic evaluation. Nielsen Norman Group, 1(1):8, 1995.
  38. Olsen Jr, D. R. Evaluating user interface systems research. In Proceedings of the 20th annual ACM symposium on User interface software and technology, pp.  251–258, 2007.
  39. The growth of cognitive modeling in human-computer interaction since goms. In Readings in Human–Computer Interaction, pp.  603–625. Elsevier, 1995.
  40. Ways of Knowing in HCI, volume 2. Springer, 2014.
  41. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  42. Pirolli, P. Cognitive models of human-information interaction. Handbook of applied cognition, pp.  443–470, 2007.
  43. Ai and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366, 2021.
  44. A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39, 2022.
  45. Schmuckler, M. A. What is ecological validity? a dimensional analysis. Infancy, 2(4):419–436, 2001.
  46. Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency, pp.  59–68, 2019.
  47. Sociotechnical harms: Scoping a taxonomy for harm reduction. arXiv preprint arXiv:2210.05791, 2022.
  48. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  49. Beyond expertise and roles: A framework to characterize the stakeholders of interpretable machine learning and their needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp.  1–16, 2021.
  50. A human-centered agenda for intelligible machine learning. Machines We Trust: Getting Along with Artificial Intelligence, 2020.
  51. Emergent abilities of large language models. Transactions on Machine Learning Research.
  52. Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  214–229, 2022.
  53. Evaluating conversational recommender systems via user simulation. In Proceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, pp.  1512–1520, 2020.
  54. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  55. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  56. Deconstructing nlg evaluation: Evaluation practices, assumptions, and their implications. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  314–324, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Q. Vera Liao (49 papers)
  2. Ziang Xiao (25 papers)
Citations (18)

Summary

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

The paper under review, titled "Rethinking Model Evaluation as Narrowing the Socio-Technical Gap" by Q. Vera Liao and Ziang Xiao, explores the intricacies of evaluating LLMs within the context of their increasing prevalence in diverse applications. The paper presents an argument that traditional model evaluation metrics are inadequate in addressing the complex socio-technical considerations of these models, and proposes a framework for evolving evaluation practices to more effectively bridge the socio-technical gap.

The current landscape of LLMs signifies a pivotal shift in the field of NLP, characterized by the homogenization of a vast range of applications powered by general-purpose models. While this development offers potential advantages in terms of model efficiency and accessibility, it also introduces significant challenges for evaluating these models. The authors argue that model evaluation must transcend traditional performance metrics that predominantly focus on lexical matching, such as ROUGE scores, acknowledging the context-dependent nature and multi-dimensional criteria of model outputs.

The concept of the socio-technical gap, rooted in human-computer interaction (HCI), refers to the discrepancy between technological capabilities and human requirements in deployment contexts. Liao and Xiao draw upon lessons from HCI and the field of explainable AI (XAI) to underline the necessity for developing evaluation methods that are grounded in real-world requirements, advocating for an interdisciplinary approach that incorporates socio-technical considerations into the evaluation process.

Two primary goals guide the proposed approach to evaluation: firstly, to systematically paper human needs and socio-requirements in downstream use cases, establishing principles and representations that shape the development and assessment of machine learning technologies. Secondly, to create evaluation methods that serve as valid proxies for socio-requirements, balancing realism with pragmatic considerations such as cost and resource allocation.

To narrow the socio-technical gap effectively, the paper identifies a range of evaluation methods along two dimensions: context realism and human requirement realism. Notably, the authors discuss opportunities for improvement in current LLM evaluation practices, highlighting efforts such as HELM (Holistic Evaluation of LLMs) that seek to map existing benchmarks to specific use-case scenarios, thereby enhancing context realism.

The implications of this research are manifold. Practically, adopting evaluation methods that consider socio-technical requirements will enhance the deployment and utility of LLMs across various domains, aligning model capabilities with human values and needs. Theoretically, this perspective challenges the AI research community to integrate interdisciplinary insights and methodologies in shaping future model evaluation frameworks.

Looking forward, the paper encourages exploration into several open questions, such as refining evaluation metrics to encapsulate constructs aligned with human values, determining the most effective representation of downstream use cases, and defining justified trade-offs between evaluation costs and methodological fidelity. Overall, this research proposes a comprehensive and nuanced paradigm for evaluating LLMs, urging a shift from performance-centric to context-aware assessment practices. The paper contributes substantially to ongoing discourse regarding the responsible and effective integration of AI systems into social and technical infrastructures.

Youtube Logo Streamline Icon: https://streamlinehq.com