Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Psychological Metrics for Dialog System Evaluation (2305.14757v2)

Published 24 May 2023 in cs.CL

Abstract: We present metrics for evaluating dialog systems through a psychologically-grounded "human" lens in which conversational agents express a diversity of both states (e.g., emotion) and traits (e.g., personality), just as people do. We present five interpretable metrics from established psychology that are fundamental to human communication and relationships: emotional entropy, linguistic style and emotion matching, agreeableness, and empathy. These metrics can be applied (1) across dialogs and (2) on turns within dialogs. The psychological metrics are compared against seven state-of-the-art traditional metrics (e.g., BARTScore and BLEURT) on seven standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate that our proposed metrics offer novel information; they are uncorrelated with traditional metrics, can be used to meaningfully compare dialog systems, and lead to increased accuracy (beyond existing traditional metrics) in predicting crowd-sourced dialog judgements. The interpretability and unique signal of our psychological metrics make them a valuable tool for evaluating and improving dialog systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Salvatore Giorgi (18 papers)
  2. Shreya Havaldar (10 papers)
  3. Farhan Ahmed (12 papers)
  4. Zuhaib Akhtar (6 papers)
  5. Shalaka Vaidya (1 paper)
  6. Gary Pan (1 paper)
  7. Lyle H. Ungar (16 papers)
  8. H. Andrew Schwartz (32 papers)
  9. Joao Sedoc (5 papers)
Citations (1)