Rethinking Model Evaluation as Narrowing the Socio-Technical Gap (2306.03100v3)

Published 1 Jun 2023 in cs.HC and cs.AI

Abstract: The recent development of generative and LLMs poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

PDF HTML Abstract

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

The paper under review, titled "Rethinking Model Evaluation as Narrowing the Socio-Technical Gap" by Q. Vera Liao and Ziang Xiao, explores the intricacies of evaluating LLMs within the context of their increasing prevalence in diverse applications. The paper presents an argument that traditional model evaluation metrics are inadequate in addressing the complex socio-technical considerations of these models, and proposes a framework for evolving evaluation practices to more effectively bridge the socio-technical gap.

The current landscape of LLMs signifies a pivotal shift in the field of NLP, characterized by the homogenization of a vast range of applications powered by general-purpose models. While this development offers potential advantages in terms of model efficiency and accessibility, it also introduces significant challenges for evaluating these models. The authors argue that model evaluation must transcend traditional performance metrics that predominantly focus on lexical matching, such as ROUGE scores, acknowledging the context-dependent nature and multi-dimensional criteria of model outputs.

The concept of the socio-technical gap, rooted in human-computer interaction (HCI), refers to the discrepancy between technological capabilities and human requirements in deployment contexts. Liao and Xiao draw upon lessons from HCI and the field of explainable AI (XAI) to underline the necessity for developing evaluation methods that are grounded in real-world requirements, advocating for an interdisciplinary approach that incorporates socio-technical considerations into the evaluation process.

Two primary goals guide the proposed approach to evaluation: firstly, to systematically paper human needs and socio-requirements in downstream use cases, establishing principles and representations that shape the development and assessment of machine learning technologies. Secondly, to create evaluation methods that serve as valid proxies for socio-requirements, balancing realism with pragmatic considerations such as cost and resource allocation.

To narrow the socio-technical gap effectively, the paper identifies a range of evaluation methods along two dimensions: context realism and human requirement realism. Notably, the authors discuss opportunities for improvement in current LLM evaluation practices, highlighting efforts such as HELM (Holistic Evaluation of LLMs) that seek to map existing benchmarks to specific use-case scenarios, thereby enhancing context realism.

The implications of this research are manifold. Practically, adopting evaluation methods that consider socio-technical requirements will enhance the deployment and utility of LLMs across various domains, aligning model capabilities with human values and needs. Theoretically, this perspective challenges the AI research community to integrate interdisciplinary insights and methodologies in shaping future model evaluation frameworks.

Looking forward, the paper encourages exploration into several open questions, such as refining evaluation metrics to encapsulate constructs aligned with human values, determining the most effective representation of downstream use cases, and defining justified trade-offs between evaluation costs and methodological fidelity. Overall, this research proposes a comprehensive and nuanced paradigm for evaluating LLMs, urging a shift from performance-centric to context-aware assessment practices. The paper contributes substantially to ongoing discourse regarding the responsible and effective integration of AI systems into social and technical infrastructures.

PDF Markdown Bookmark Chat (Pro)

References (56)

Authors (2)

Q. Vera Liao (49 papers)
Ziang Xiao (25 papers)

Citations (18)

View on Semantic Scholar

Tweets

https://twitter.com/LChoshen/status/1802691879722528975

https://twitter.com/BeatriceSavoldi/status/1811413905911996694

YouTube

Show All Videos

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap (2306.03100v3)