EQ-Bench: LLM Emotional Intelligence Benchmark
- EQ-Bench is an open-source benchmark that rigorously assesses emotional understanding in large language models through dialogue-based evaluation of emotion intensity.
- Its methodology employs conflict-driven dialogue prompts, expert-authored reference keys, and a two-stage critique plus revision protocol for refined scoring.
- Strong correlations with general reasoning metrics and low scoring variance highlight EQ-Bench’s significance in advancing research on emotionally intelligent AI systems.
EQ-Bench is an open-source, automated benchmark designed to rigorously assess emotional understanding (EU)—a core component of emotional intelligence (EQ)—in LLMs. Unlike multi-domain knowledge, code generation, or human-preference benchmarks, EQ-Bench directly measures the ability of LLMs to interpret the intensity of complex emotions in text-based dialogues. By systematically eliciting nuanced responses through conflict-driven conversational snippets, EQ-Bench enables reproducible, discriminative, and objective evaluation of both proprietary and open-source models. A notable outcome is EQ-Bench’s observed strong correlation with comprehensive cognitive benchmarks, suggesting an intrinsic link between emotional and general reasoning capabilities in LLM architectures (Paech, 2023).
1. Motivation and Conceptual Foundations
EQ-Bench addresses deficiencies in prior LLM evaluation metrics, which typically overlook emotional intelligence despite its recognized influence on decision-making and interpersonal communication. In humans, high EQ is integral for trustworthiness, safety, and natural interaction—properties critical to the deployment of conversational AI. Prior art, including the SECEU benchmark, attempted to rate EU but suffered from compressed scoring ranges, rigid summation constraints, and reliance on crowd-averaged reference answers. EQ-Bench advances the state-of-the-art by providing a highly repeatable, dynamic-range-sensitive measurement that correlates tightly with broad knowledge and reasoning metrics such as MMLU (r = 0.97) (Paech, 2023).
Key methodological innovations include:
- Dialogue-centric prompts with inherent social tension to invoke nontrivial EU.
- Four carefully curated emotions per scenario to foster discrimination.
- No summation constraint at prediction time; objective normalization post-hoc.
- Use of expert-authored (not crowdsourced) reference keys.
- Format-enforced, machine-parsable outputs for automated scoring.
- Zero-shot evaluation protocol augmented by a “critique + revision” reflection step.
- Automated scoring pipeline for robust, low-variance benchmarking.
2. Dataset Construction and Task Specification
The test corpus consists of 60 English dialogue snippets, each portraying a context of conflict or nuanced social interaction, synthetically generated with GPT-4 using randomized parameters for setting and tone. Each instance concludes with a prompt: “At the end of this dialogue, <Name> would feel: [Emotion1, Emotion2, Emotion3, Emotion4]. Give each emotion a score 0–10.” Four disparate emotional labels are paired per dialogue, with a mix of correct, incorrect, and ambiguous distractors. Reference intensities for each emotion, ranging from 0 to 10, are set by the creators to ensure ceiling effects and prevent norming to crowd intuition. Both predictions and references are normalized to sum to 10, enforcing comparative rather than absolute calibration.
Emotion categories span a wide spectrum: e.g., Surprised, Confused, Angry, Forgiving; Offended, Empathetic, Confident, Dismissive; Worried, Relieved, Anxious, Hopeless. For instance, in a scenario where characters Cecilia and Brandon discuss the power of words in art, possible emotions include Offended, Empathetic, Confident, and Dismissive.
3. Scoring Protocols and Evaluation Criteria
EQ-Bench employs a two-stage, reflection-based evaluation for each prompt:
- The LLM generates “first pass” intensity ratings (0–10) for all four candidate emotions.
- The model is then prompted to critique its outputs and submit “revised scores.”
Normalization ensures that the subject and reference intensity vectors each have an L₁-norm of 10. The per-question alignment score is computed as:
A score of 10 indicates perfect correspondence; random guessing yields an expected score near zero. The benchmark score for a given model is the average across all questions (≥50 valid answers required), and the higher of the first-pass or revised scores is reported as the official result.
Additional evaluation dimensions include:
- Pearson correlation (r) relative to multi-domain intelligence benchmarks.
- Coefficient of Variation (CV) to quantify repeatability.
- Mean Squared Error (MSE) of intensity deviations (sometimes reported).
- Pass/fail counts based on output parsability.
4. Model Evaluation, Prompting Protocol, and Reproducibility
EQ-Bench has been used to benchmark a diverse spectrum of LLMs in zero-shot, “critique + revision” settings, encompassing both proprietary and open-source models:
| Model Name | EQ-Bench Score (“Revised”) |
|---|---|
| OpenAI gpt-4-0613 | 62.52 |
| migtissera/SynthIA-70B-v1.5 | 54.83 |
| OpenAI gpt-4-0314 | 53.39 |
| Qwen/Qwen-72B-Chat | 52.44 |
| Anthropic Claude2 | 52.14 |
| meta-llama/Llama-2-70b-chat-hf | 51.56 |
| 01-ai/Yi-34B-Chat | 51.03 |
| OpenAI gpt-3.5-0613 | 49.17 |
| OpenAI gpt-3.5-turbo-0301 | 47.61 |
Prompting follows a zero-shot, fixed-format regimen: models are instructed to print their “First pass scores,” a critique, and “Revised scores.” This structured format enables parsing automation and reliable scoring. The inference temperature is set at 0.01 to minimize stochastic variation; if an output is unparseable, temperature is incremented by +0.15 for up to five retries. Open-source models are quantized via bitsandbytes (8-bit for ≤34B, 4-bit otherwise).
A full, reproducible pipeline—with 60 prompts, reference keys, batch evaluation scripts, normalizer, scoring code, and analytical notebooks—is provided under the MIT license at https://github.com/EQ-bench/EQ-Bench. The public leaderboard is continuously updated at https://eqbench.com.
5. Results, Error Analysis, and Correlation with Other Benchmarks
EQ-Bench demonstrates a broad score dynamic range across models, enabling fine-grained discrimination. Noteworthy findings include:
- Top-performing models (gpt-4-0613) attain scores of 62.52, while strong open-source models approach but do not match this ceiling.
- The mean improvement from the “critique + revision” step is 9.3%, with the greatest gains observed in weaker models.
- High inter-run repeatability is evidenced by a CV of ≈2.9%.
- Comparison to SECEU reveals a much wider interquartile range (IQR = 53.8 vs. 14.7) and tighter correlations with MMLU, HellaSwag, and ARC.
- The Pearson correlation between EQ-Bench and MMLU is 0.97, indicating that emotional understanding, as measured by this benchmark, very strongly tracks with general reasoning and world knowledge.
6. Limitations and Future Extensions
Recognized limitations include:
- Use of author-generated reference answers introduces potential bias and lacks inter-rater reliability verification.
- All dialogues are synthetic, generated by GPT-4, which may impose stylistic artifacts.
- There are no native human cohort scores, precluding normalization to human EQ means.
- Inherent subjectivity in assigning “correct” emotion intensities remains unresolved.
Ongoing and suggested extensions are:
- Involving certified EI experts to create reference keys and expanded, multi-party or multi-turn scenarios.
- Appending human participant data to calibrate model performance against normative EQ curves.
- Expanding dialogue coverage to multiple languages and modalities (e.g., inclusion of nonverbal cues).
- Stress-testing via adversarial prompt design to evaluate benchmark gaming resistance.
7. Significance and Research Implications
EQ-Bench establishes the first systematic, open-access standard for evaluating emotional intelligence in LLMs, providing a robust, objective, and hard-to-game tool for academic and industrial model assessment. Its strong alignment with broad intelligence metrics suggests that emotional understanding is not an isolated skill but one governed by the same deeper reasoning and world-knowledge capabilities that underlie general intelligence in LLMs. Fine-tuning for role-play improves performance substantially, with top open-source models approaching proprietary systems, yet the large margin to the theoretical maximum (100) reveals scope for further advances.
EQ-Bench’s methodology, resources, and active leaderboard enable transparent progress tracking in LLM emotional sophistication, catalyzing future research into safe and socially intelligent AI systems (Paech, 2023).