Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench (2308.03656v6)

Published 7 Aug 2023 in cs.CL

Abstract: Evaluating LLMs' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes seven LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4, Mixtral-8x22B, and LLaMA-3.1. We find that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, i.e., EmotionBench, are publicly available at https://github.com/CUHK-ARISE/EmotionBench.

PDF Abstract

The paper "Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench" addresses the assessment of LLMs' anthropomorphic capabilities, specifically focusing on their ability to empathize by evaluating their emotional responses to various situations. This evaluation is grounded in emotion appraisal theory from psychology, which examines how emotions are triggered in humans based on specific circumstances.

To achieve this, the authors developed a dataset containing over 400 situations known to elicit eight central emotions. These situations are categorized into 36 different factors. The dataset was constructed after an extensive review, ensuring that it effectively captures the spectrum of emotions relevant to the paper.

A crucial part of their methodology involved a human evaluation, gathering data from over 1,200 subjects globally. This human-centric evaluation served as a benchmark for testing and refining the emotional response capabilities of LLMs.

The researchers then assessed five LLMs, including both commercial and open-source variants, and featuring models like GPT-4 and LLaMA-2. Their findings indicate that while LLMs generally demonstrated some appropriate responses, they often misaligned with human-like emotional behaviors. Specifically, these models struggled to establish connections between similar emotional situations and frequently deviated from expected empathetic responses.

The authors have made the dataset, human evaluation results, and the EmotionBench testing framework available to the public. Their goal is to use this research as a stepping stone towards enhancing LLMs' alignment with human emotional behaviors, ultimately improving their effectiveness as intelligent assistants.

Overall, the paper contributes to ongoing discussions about improving the empathetic capabilities of LLMs, addressing a significant aspect of their integration into human-centered applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Jen-tse Huang (46 papers)
Man Ho Lam (6 papers)
Eric John Li (4 papers)
Shujie Ren (3 papers)
Wenxuan Wang (128 papers)
Wenxiang Jiao (44 papers)
Zhaopeng Tu (135 papers)
Michael R. Lyu (176 papers)

Citations (35)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - CUHK-ARISE/EmotionBench: Benchmarking LLMs' Emotional Alignment with Humans (58 stars)

Tweets

https://twitter.com/Chenyang_Lyu/status/1853400697762668662