The paper "Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench" addresses the assessment of LLMs' anthropomorphic capabilities, specifically focusing on their ability to empathize by evaluating their emotional responses to various situations. This evaluation is grounded in emotion appraisal theory from psychology, which examines how emotions are triggered in humans based on specific circumstances.
To achieve this, the authors developed a dataset containing over 400 situations known to elicit eight central emotions. These situations are categorized into 36 different factors. The dataset was constructed after an extensive review, ensuring that it effectively captures the spectrum of emotions relevant to the paper.
A crucial part of their methodology involved a human evaluation, gathering data from over 1,200 subjects globally. This human-centric evaluation served as a benchmark for testing and refining the emotional response capabilities of LLMs.
The researchers then assessed five LLMs, including both commercial and open-source variants, and featuring models like GPT-4 and LLaMA-2. Their findings indicate that while LLMs generally demonstrated some appropriate responses, they often misaligned with human-like emotional behaviors. Specifically, these models struggled to establish connections between similar emotional situations and frequently deviated from expected empathetic responses.
The authors have made the dataset, human evaluation results, and the EmotionBench testing framework available to the public. Their goal is to use this research as a stepping stone towards enhancing LLMs' alignment with human emotional behaviors, ultimately improving their effectiveness as intelligent assistants.
Overall, the paper contributes to ongoing discussions about improving the empathetic capabilities of LLMs, addressing a significant aspect of their integration into human-centered applications.