Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Emergence of Hierarchical Emotion Organization in Large Language Models (2507.10599v1)

Published 12 Jul 2025 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels -- a psychological framework that argues emotions organize hierarchically -- we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

Summary

  • The paper introduces a novel tree-construction algorithm that reveals hierarchical emotion structures in LLMs, closely aligning with established psychological emotion wheels.
  • It demonstrates that larger models form deeper and more complex emotion hierarchies, enhancing emotion prediction accuracy through rigorous graph-based metrics.
  • The study uncovers systematic biases in emotion recognition across different demographic personas, emphasizing the need for more inclusive training datasets.

Emergence of Hierarchical Emotion Organization in LLMs

The paper "Emergence of Hierarchical Emotion Organization in LLMs" explores the development of hierarchical emotion structures in LLMs and investigates the alignment of these structures with human psychological models. The paper also addresses the impact of demographic personas on emotion recognition biases and aims to understand the evaluation of LLMs' emotional reasoning capabilities.

Hierarchical Emotion Structures in LLMs

The paper introduces a novel tree-construction algorithm inspired by emotion wheels, a psychological framework, to reveal emotion hierarchies in LLMs. By analyzing the logit activations of models of varying sizes, the research observes the emergence of hierarchical organizations of emotional states as the model size increases. The hierarchical structures formed by LLMs increasingly align with established psychological models of emotions, similar to the way humans understand emotions. Figure 1

Figure 1

Figure 1: LLMs develop more complex hierarchical organization of emotions with scale, aligning with psychological models.

To construct these hierarchies, the probability distributions for a set of 135 emotion words are analyzed across different situational prompts. The relationships between emotions are captured in a directed acyclic graph formed based on probabilistic dependencies, revealing how broader emotions encompass more specific emotions within the hierarchy.

Bias and Emotion Recognition Across Personas

Experimentation demonstrates systematic biases in emotion recognition across different demographic personas. LLMs exhibit deviations in recognition performance when simulating personas of underrepresented groups such as females, Black individuals, and those with low income or education levels. These biases are particularly pronounced when multiple minority attributes intersect. Figure 2

Figure 2: Llama 3.1 shows lower emotion recognition accuracy for underrepresented groups compared to majority groups.

Game-theoretic metrics extracted from the hierarchical emotion trees provide insights into how well LLMs recognize emotions. It is evident that more complex emotion hierarchies correlate with better accuracy in emotion prediction tasks. Specifically, longer paths and greater depth in the emotion tree structures often enhance predictive performance, hinting at the neural networks' ability to develop a nuanced understanding of intricate emotional landscapes.

Alignment with Human Psychological Models

The paper provides a comparative analysis of the emotion hierarchies generated by LLMs and traditional human emotion models. Quantitative correlations between the emotion tree structures derived from LLMs and human-annotated emotion wheels validate the alignment of machine-derived emotion hierarchies with human cognitive frameworks. Figure 3

Figure 3: Hierarchical emotion structures derived from Llama models align closely with human-annotated emotion relationships.

This alignment suggests that larger LLMs not only improve in the breadth of emotional states they can comprehend but also in the hierarchical and organizational depth of these states. These enhancements reflect a finer granularity in the model's emotional processing capabilities.

Discussion and Implications

The research underscores the importance of using cognitively-grounded theories as benchmarks for evaluating model performance on complex tasks such as emotional reasoning. Emotion recognition accuracy in LLMs increases with model size, alongside potential improvements in anthropogenic interaction capabilities. However, the recognition biases indicate a need for more inclusive training datasets and bias reduction strategies.

In practical terms, the implementation of these findings can guide the development of emotionally intelligent AI systems that are more attuned to the intricacies of human emotions, potentially benefiting applications such as AI-assisted counseling and human-computer interaction systems. However, meticulous consideration of ethical implications, especially concerning biases, is imperative.

Conclusion

The paper advances our understanding of how LLMs develop hierarchical representations of emotions and highlights both their capabilities and biases. The ongoing refinement of these models in alignment with human psychological frameworks presents promising opportunities for enhancing AI's role in emotionally sensitive applications. Nonetheless, researchers must continue to address the biases inherent in these systems to ensure fair and effective deployment.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper studies how LLMs—like the ones behind chatbots—understand human emotions. The authors ask: Do LLMs organize emotions in a “family tree” like psychologists think people do? And do LLMs show the same kinds of biases people show when recognizing emotions in different social groups?

What questions did the researchers ask?

The paper focuses on three simple questions:

  • Do LLMs naturally arrange emotions in a hierarchy, with broad emotions (like joy) containing more specific ones (like optimism)?
  • Does this emotion understanding improve as the models get bigger?
  • When LLMs try to recognize emotions for different kinds of people (different genders, races, incomes, etc.), do they show any patterns of bias—and are those patterns similar to ones humans show?

How did they paper it? (Explained simply)

Think of emotions like a big wheel or a family tree: broad emotions at the top (love, joy, anger, sadness, fear, surprise) and more specific emotions underneath (like trust under love, or frustration under anger). Psychologists have drawn versions of these “emotion wheels” for years.

The researchers built a way to see whether LLMs have a similar structure in their “heads.” Here’s the basic idea:

  • They generated thousands of short situations that express different emotions (for example, “You studied hard and got a great grade” might suggest joy or pride).
  • They asked an LLM to continue each situation with the phrase “The emotion in this sentence is” and looked at which emotion words the model thought were most likely next. You can picture this like the model producing a “confidence score” for each emotion word.
  • They made a big table that counts how often pairs of emotion words tend to show up together across many situations. For example, if “optimism” and “joy” tend to both be likely for the same situations, that pair gets a higher score.
  • Using those scores, they built a “tree” of emotions by connecting specific emotions to broader parent emotions. For instance, if “optimism” usually appears when “joy” appears (but not always the other way around), “joy” becomes the parent of “optimism.”

They tried this with different sizes of LLMs (from smaller to very large) to see how the emotion trees change with scale.

They also tested for bias:

  • They asked the LLM to identify emotions while “speaking as” different personas (like “As a woman…”, “As a Black person…”, “As a low-income person…”, etc.).
  • They looked at how often the model got the right emotion, and where it tended to make mistakes.

Finally, they ran a small human paper and compared human mistakes to the LLM’s mistakes.

What did they find?

Here are the main takeaways:

  • Bigger models build better emotion trees. Small models had weak or unclear emotion structures. As models got larger, the trees became deeper and more organized—closer to what psychologists expect (for example, emotions grouped under joy, sadness, anger, etc.).
  • The models’ emotion trees align with human psychology. The way larger models cluster related emotions looks surprisingly similar to classic emotion wheels used in psychology.
  • The structure predicts performance. Models that build richer emotion trees also do better at recognizing emotions in new situations. In simple terms: more organized “emotion maps” lead to more accurate emotion recognition.
  • The models show social biases. When the model “speaks as” different personas, it recognizes emotions more accurately for majority groups (for example, White, male, high-income, highly educated) than for underrepresented groups (for example, Black, female, low-income, less educated). These gaps get worse when multiple minority traits are combined (for instance, low-income Black female persona).
  • Specific misclassifications appear by group. Examples include:
    • For Asian personas, negative emotions often get labeled as “shame.”
    • For Hindu personas, negative emotions often get labeled as “guilt.”
    • For physically disabled personas, many emotions get mislabeled as “frustration.”
    • For Black personas, situations labeled as sadness are more often misclassified as anger.
    • For low-income female personas, other emotions are more often misclassified as fear.
  • Humans and LLMs make similar mistakes. In a small user paper, people showed some of the same patterns as the LLM. For example, both Black participants and the LLM speaking as a Black persona tended to interpret fear scenarios as anger. Female participants and the LLM speaking as female tended to confuse anger with fear.

Why does this matter?

  • Understanding emotions is crucial for chatbots and digital assistants that interact with people. If these systems organize emotions similarly to humans, they may become better at empathy, counseling, and everyday conversation.
  • But biases are a serious concern. If an LLM misreads emotions more often for certain groups, it could lead to unfair treatment, misunderstandings, or harm—especially in sensitive areas like mental health or customer support.
  • The method itself is useful. The authors show a way to test models using ideas from psychology. This can help build better evaluations for LLMs—not just for emotions, but potentially for other kinds of human understanding too.

Conclusion and impact

In simple terms, the paper shows that LLMs don’t just label emotions—they seem to organize them in a human-like “family tree,” and this organization gets more refined as models get bigger. That’s exciting because it hints at deeper emotional reasoning. At the same time, the models reflect real-world social biases, especially for intersectional identities, which must be addressed before deploying these systems widely. The approach of using psychology-driven tests offers a promising path for building fairer, more trustworthy AI that understands people better—without forgetting who it might misunderstand.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 5 likes.

Upgrade to Pro to view all of the tweets about this paper: