Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models (2312.06281v2)

Published 11 Dec 2023 in cs.CL and cs.AI
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models

Abstract: We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in LLMs. We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://eqbench.com

Overview of "EQ-Bench: An Emotional Intelligence Benchmark for LLMs"

The paper, "EQ-Bench: An Emotional Intelligence Benchmark for LLMs," authored by Samuel J. Paech, introduces EQ-Bench—a novel benchmark specifically designed to evaluate emotional intelligence (EI) in LLMs. The primary focus is on emotional understanding (EU), a branch of EI defined as the ability to comprehend and interpret complex emotions within social interactions.

Motivation and Methodological Advancements

Within the field of LLM evaluation, existing benchmarks primarily assess general knowledge or specific abilities such as coding. However, none target emotional comprehension, despite its relevance to human-like interaction. EQ-Bench aims to fill this gap by offering a targeted assessment of EU, utilizing dialogues crafted to reflect emotionally charged situations.

Key improvements over prior benchmarks, particularly SECEU, include:

  1. Reference Answers by Authors: Unlike SECEU, which uses crowd-sourced answers, EQ-Bench features reference answers determined by the authors to avoid potential bias.
  2. Complex Dialogue Scenarios: The questions are based on dialogues generated by GPT-4 that depict scenes of tension or conflict, thus providing a nuanced basis for assessing EU.
  3. Removal of Summation Constraint: SECEU's requirement for emotion intensity ratings to sum to ten was omitted, allowing a more intuitive approach to assessing emotional context.
  4. Diverse Emotions Selection: Instead of selecting plausible emotions, a range of emotions is chosen to minimize ambiguity.

Methodology

The benchmark employs questions that necessitate LLMs to rate emotional intensity on a scale of 0-10 for four emotions at the conclusion of a dialogue. This structure facilitates objective scoring by measuring divergence from predetermined reference answers.

The paper emphasizes a comprehensive testing pipeline designed for automated benchmarking, available open-source for further community engagement.

Results

The results reveal EQ-Bench's ability to discern differences across models. For instance, OpenAI's GPT-4 exhibits the highest EQ-Bench score, outperforming prominent models such as Anthropic Claude2 and Meta's Llama variants. Open-source models like SynthIA-70B also show strong performance, indicating the growing capabilities of community-driven models.

Importantly, EQ-Bench scores correlate strongly with established benchmarks like MMLU (r=0.97), suggesting that EU is a viable proxy for assessing broad intelligence in LLMs. Despite methodological similarities, a weaker correlation exists with SECEU, highlighting EQ-Bench's distinct approach and improved scoring distribution.

Limitations and Future Directions

While EQ-Bench provides insightful assessments, the inherent subjectivity of emotional interpretation poses challenges. Future work may involve refining questions with expert input and potentially establishing human baseline scores. Additionally, bias from GPT-4 generated dialogues could be mitigated with human-authored scenarios.

The benchmark's openness allows for transparent adaptation as LLM capabilities evolve, with ongoing adjustments anticipated to maintain its relevance and robustness against targeted overfitting in fine-tuning practices.

Conclusion

EQ-Bench represents a significant step towards nuanced assessments of LLMs, addressing an essential aspect of interaction—emotional intelligence. By providing a rigorous proxy for general intelligence evaluation, it offers both practical and theoretical insights into the interplay between emotional and broad intelligence within machine learning frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleThink you have solved question answering? try arc, the ai2 reasoning challenge Think you have solved question answering? try arc, the ai2 reasoning challenge.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1803.05457. \PrintBackRefs\CurrentBib
  2. \APACrefYearMonthDay2021. \BBOQ\APACrefatitle8-bit optimizers via block-wise quantization 8-bit optimizers via block-wise quantization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2110.02861. \PrintBackRefs\CurrentBib
  3. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChatGPT outperforms humans in emotional awareness evaluations Chatgpt outperforms humans in emotional awareness evaluations.\BBCQ \APACjournalVolNumPagesFrontiers in Psychology141199058. \PrintBackRefs\CurrentBib
  4. \APACinsertmetastargoleman1996emotional{APACrefauthors}Goleman, D.  \APACrefYearMonthDay1996. \BBOQ\APACrefatitleEmotional intelligence. Why it can matter more than IQ. Emotional intelligence. why it can matter more than iq.\BBCQ \APACjournalVolNumPagesLearning24649–50. \PrintBackRefs\CurrentBib
  5. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMeasuring massive multitask language understanding Measuring massive multitask language understanding.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2009.03300. \PrintBackRefs\CurrentBib
  6. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleMeasuring mathematical problem solving with the math dataset Measuring mathematical problem solving with the math dataset.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2103.03874. \PrintBackRefs\CurrentBib
  7. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe scoring challenge of Emotional Intelligence ability tests: A Confirmatory Factor Analysis approach to model substantive and method effects using raw item scores The scoring challenge of emotional intelligence ability tests: A confirmatory factor analysis approach to model substantive and method effects using raw item scores.\BBCQ \APACjournalVolNumPagesFrontiers in Psychology13812525. \PrintBackRefs\CurrentBib
  8. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleA systematic review of the pain scales in adults: which to use? A systematic review of the pain scales in adults: which to use?\BBCQ \APACjournalVolNumPagesThe American journal of emergency medicine364707–714. \PrintBackRefs\CurrentBib
  9. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleLarge language models are zero-shot reasoners Large language models are zero-shot reasoners.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems3522199–22213. \PrintBackRefs\CurrentBib
  10. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAlpacaeval: An automatic evaluator of instruction-following models Alpacaeval: An automatic evaluator of instruction-following models.\BBCQ \APACjournalVolNumPagesGitHub repository. \PrintBackRefs\CurrentBib
  11. \APACrefYearMonthDay2023. \APACrefbtitleMistralOrca: Mistral-7B Model Instruct-tuned on Filtered OpenOrcaV1 GPT-4 Dataset. Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset. \APAChowpublishedhttps://huggingface.co/Open-Orca/Mistral-7B-OpenOrca. \APACaddressPublisherHuggingFace. \PrintBackRefs\CurrentBib
  12. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTruthfulqa: Measuring how models mimic human falsehoods Truthfulqa: Measuring how models mimic human falsehoods.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2109.07958. \PrintBackRefs\CurrentBib
  13. \APACinsertmetastarchatbot_arena_leaderboard{APACrefauthors}LMSYS.  \APACrefYearMonthDay2023. \APACrefbtitleChatbot Arena Leaderboard. Chatbot arena leaderboard. {APACrefURL} https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard \APACrefnoteAccessed: 2023-12-06 \PrintBackRefs\CurrentBib
  14. \APACrefYearMonthDay1997. \BBOQ\APACrefatitleWhat Is The Emotional Intelligence? Implications for Education What is the emotional intelligence? implications for education.\BBCQ \APACjournalVolNumPagesEmotional Development, Emotional Literacy, and Emotional Intelligence, New york: Basic books. \PrintBackRefs\CurrentBib
  15. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleGenetic and environmental influences on emotion regulation: A twin study of cognitive reappraisal and expressive suppression. Genetic and environmental influences on emotion regulation: A twin study of cognitive reappraisal and expressive suppression.\BBCQ \APACjournalVolNumPagesEmotion175772. \PrintBackRefs\CurrentBib
  16. \APACinsertmetastarogurlu2021meta{APACrefauthors}Ogurlu, U.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA meta-analytic review of emotional intelligence in gifted individuals: A multilevel analysis A meta-analytic review of emotional intelligence in gifted individuals: A multilevel analysis.\BBCQ \APACjournalVolNumPagesPersonality and Individual Differences171110503. \PrintBackRefs\CurrentBib
  17. \APACinsertmetastaropenai2023gpt4{APACrefauthors}OpenAI.  \APACrefYearMonthDay2023. \APACrefbtitleGPT-4 Technical Report. Gpt-4 technical report. \PrintBackRefs\CurrentBib
  18. \APACrefYearMonthDay1990. \BBOQ\APACrefatitleEmotional intelligence Emotional intelligence.\BBCQ \APACjournalVolNumPagesImagination, cognition and personality93185–211. \PrintBackRefs\CurrentBib
  19. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleLlama 2: Open foundation and fine-tuned chat models Llama 2: Open foundation and fine-tuned chat models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.09288. \PrintBackRefs\CurrentBib
  20. \APACrefYearMonthDay2003. \BBOQ\APACrefatitleSocioeconomic status modifies heritability of IQ in young children Socioeconomic status modifies heritability of iq in young children.\BBCQ \APACjournalVolNumPagesPsychological science146623–628. \PrintBackRefs\CurrentBib
  21. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleAttention is all you need Attention is all you need.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems30. \PrintBackRefs\CurrentBib
  22. \APACrefYearMonthDay2008. \BBOQ\APACrefatitleA behavioral genetic study of trait emotional intelligence. A behavioral genetic study of trait emotional intelligence.\BBCQ \APACjournalVolNumPagesEmotion85635. \PrintBackRefs\CurrentBib
  23. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleEmotional intelligence of large language models Emotional intelligence of large language models.\BBCQ \APACjournalVolNumPagesJournal of Pacific Rim Psychology1718344909231213958. \PrintBackRefs\CurrentBib
  24. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe dawn of lmms: Preliminary explorations with gpt-4v (ision) The dawn of lmms: Preliminary explorations with gpt-4v (ision).\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2309.1742191. \PrintBackRefs\CurrentBib
  25. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleTree of thoughts: Deliberate problem solving with large language models Tree of thoughts: Deliberate problem solving with large language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.10601. \PrintBackRefs\CurrentBib
  26. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleHellaswag: Can a machine really finish your sentence? Hellaswag: Can a machine really finish your sentence?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1905.07830. \PrintBackRefs\CurrentBib
  27. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleJudging LLM-as-a-judge with MT-Bench and Chatbot Arena Judging llm-as-a-judge with mt-bench and chatbot arena.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.05685. \PrintBackRefs\CurrentBib
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Samuel J. Paech (1 paper)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com