Papers
Topics
Authors
Recent
Search
2000 character limit reached

HitEmotion: ToM-Based Affective Benchmark

Updated 8 February 2026
  • HitEmotion is a hierarchical benchmark that defines three levels—emotion perception, understanding, and cognition—to assess models’ ability to simulate beliefs, desires, and intentions.
  • It utilizes both closed-label and generative evaluation protocols, enabling precise measurement of cognitive depth and context-aware emotional reasoning.
  • Combining 20,114 test instances from 24 datasets, it employs process-level supervision and GRPO to diagnose and improve cognitive breakpoints in AI models.

HitEmotion is a Theory-of-Mind (ToM)-grounded hierarchical benchmark for multimodal affective intelligence, designed to diagnose cognitive depth limitations in state-of-the-art multimodal LLMs (MLLMs). Unlike traditional sentiment or emotion datasets, HitEmotion is explicitly structured to evaluate a model’s capacity for simulating the mental states—beliefs, desires, intentions—necessary for deep, contextually-aware emotional reasoning. The benchmark formalizes emotion understanding tasks across three developmental ToM stages, providing both closed-label and generative evaluation protocols, and offers practical tools and process-level supervision methods for improving model faithfulness and rationale coherence (Luo et al., 1 Feb 2026).

1. Motivation and Theory-of-Mind Foundation

Affective intelligence in AI extends beyond cue-based pattern recognition and requires explicit modeling of Theory of Mind, the psychological construct denoting the ability to represent, simulate, and reason about the mental states of others. In this framework, genuine emotion analysis arises from recursive inference chains—simulating what a subject knows, feels, or intends—rather than mapping directly from multimodal surface features. HitEmotion operationalizes this paradigm by embedding hierarchical ToM reasoning requirements into its benchmark, arranging tasks into levels that mirror developmental progressions in ToM (from first-order belief inference through second-order recursive mind modeling). This design enables systematic measurement of breakdown points in AI models' capacity for affective cognition. The absence of explicit ToM scaffolding, as demonstrated, reduces models to shallow retrievers susceptible to conflicting or misleading signals (Luo et al., 1 Feb 2026).

2. Hierarchical Benchmark Structure

HitEmotion structures its task suite into three levels, each defined by increasing ToM cognitive depth. Let X=(T,A,V)X = (T, A, V) represent multimodal inputs (Text, Audio, Video), with outputs YY consisting of a reasoning chain rr and a final answer oo, formalized as f:X(r,o)f: X \to (r, o).

  • Level 1: Emotion Perception & Recognition (EPR) First-order mappings from perceptual signals (image, sound, text) to explicit emotion or sentiment classes. Representative tasks (10 total): Face Expression Sentiment Detection, Speech Emotion Recognition, Image Sentiment Analysis, Opinion Sentiment Analysis, etc.
  • Level 2: Emotion Understanding & Analysis (EUA) Relational and contextual mind modeling: inferring not only emotional states but underlying intent, function, or social stance. Representative tasks (8 total): Persuasion detection in memes, Humor Understanding, Multiparty Dialogue Emotion Recognition, Multimodal Aspect-Based Sentiment Analysis.
  • Level 3: Emotion Cognition & Reasoning (ECR) Causal and second-order recursive reasoning: tasks demand explanation of causes, temporal and intentional dynamics, and nonliteral constructs (e.g., sarcasm, laughter). Representative tasks (6 total): Emotion Elicitation Reasoning, Sarcasm Detection, Sentiment Flip Analysis, Multimodal Emotion Cause Pair Extraction.

This stratification enables the identification of “cognitive breakpoints,” where model performance declines as ToM complexity increases.

3. Dataset Aggregation and Annotation Protocol

HitEmotion consolidates and restructures 24 publicly released datasets, producing 20,114 standardized test instances in a unified closed-label Q&A format with MCQ or short generative answers. Modalities span static images, 16-frame video clips, audio segments, and text. Annotation for each sample consists of a “prompt–answer–context” triplet, maintaining original labels while enforcing format consistency. Example: for face expression sentiment recognition, the prompt may request micro-expression and prosody decoding, offering fixed answer choices (e.g., {neutral, positive, negative}). One-third of each dataset undergoes dual annotator cross-review and arbitration to ensure rigorous “prompt → chain → answer” alignment, and only official test splits are used to prevent data leakage. This ensures high-quality grounds for both training and evaluation (Luo et al., 1 Feb 2026).

4. Evaluation Metrics and Protocol

Each benchmark level employs adapted metrics to suit its cognitive demands:

Level Key Metrics Output Types
Level 1 Accuracy (ACC), Weighted-Average F1 (WAF) Closed-label
Level 2 ACC, WAF, Micro F1 (MF) Multi-label/structural
Level 3 ACC/WAF (classification), Exact Match F1 (EMF), LLM Semantic Score (open-form) Short- and open-form

Definitions:

  • Accuracy:

ACC=1Ni=1N1[y^i=yi]ACC = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[\hat{y}_i = y_i^*]

  • Weighted-Average F1:

WAF=cC(ScN)F1cWAF = \sum_{c \in C} \left(\frac{|S_c|}{N}\right) \cdot F1_c,

F1c=2PcRcPc+RcF1_c = 2 \cdot \frac{P_c \cdot R_c}{P_c + R_c}

  • Micro F1:

MF=2PrecisionmicroRecallmicroPrecisionmicro+RecallmicroMF = 2 \cdot \frac{Precision_{micro} \cdot Recall_{micro}}{Precision_{micro} + Recall_{micro}}

Performance breakdown across model baselines (Gemini-2.5-Pro, GPT-4.1, open-source models) indicates progressive degradation in cognitive-emotional reasoning:

  • Level 1: ACC ≈ 65–78% (some tasks > 70%)
  • Level 2: ACC/MF ≈ 50–60% (only 2/8 tasks > 60%)
  • Level 3: 30–55% (no task above 60%)

This reveals clear bottlenecks as ToM requirements escalate (Luo et al., 1 Feb 2026).

5. ToM-Guided Reasoning Chain and TMPO Optimization

5.1 Reasoning Chain Generation

For each input XX and prompt PP, reasoning is decomposed into a structured chain r=[r1,r2,,rK]r = [r_1, r_2, \ldots, r_K], with each step explicitly tied to ToM operations:

  1. Perceptual Simulation (signal decoding)
  2. Cognitive Empathy (hypothesis synthesis)
  3. Perspective Taking (viewpoint attribution)
  4. (When required) Causal Attribution / Recursive Mind Modeling

Pseudocode excerpt:

1
2
3
4
5
6
Given input X=(T,A,V) and task prompt P:
    r  []
    for step in prompt-defined ToM steps:
        r.append(model.generate_step(step, X))
    o  model.generate_answer(r)
    return (r, o)

5.2 ToM-Aligned Supervised Fine-Tuning (SFT)

During SFT, reasoning steps are encapsulated with > … tags and labels with <answer>…</answer>, optimizing:

LSFT(θ)=E(P,X),y[logπθ(yP,X)]L_{SFT}(\theta) = - \mathbb{E}_{(P,X),y} \left[ \log \pi_\theta(y|P,X) \right]

5.3 ToM-Based Preference Optimization (GRPO)

Group-wise Reward Policy Optimization (GRPO) refines the model by sampling NN chain–answer candidates {yi}\{y_i\} and computing a composite reward:

R(y)=λ1Rstructure(y)+λ2Rcontent(y)+λ3Rprocess(y)+λ4Rconsistency(y)R(y) = \lambda_1 R_{structure}(y) + \lambda_2 R_{content}(y) + \lambda_3 R_{process}(y) + \lambda_4 R_{consistency}(y)

  • Structure reward assesses XML formatting and correct step sequence.
  • Content reward applies the primary task metric (e.g., ACC, MF, EMF).
  • Process reward favors ToM keyword usage (belief, intention, desire).
  • Consistency reward penalizes contradictions as judged by an LLM.

The GRPO objective with a KL constraint is:

maxθEyπθ[A(y)]βDKL(πθπref)\max_\theta \mathbb{E}_{y \sim \pi_\theta}[A(y)] - \beta D_{KL}(\pi_\theta || \pi_{ref})

where A(y)A(y) denotes group-normalized reward advantage, and πref\pi_{ref} is the SFT policy (Luo et al., 1 Feb 2026).

6. Performance Insights and Research Impact

HitEmotion reveals systematic cognitive deficits in contemporary MLLMs: performance declines as ToM depth increases, with no baseline exceeding 60% on the most complex Level 3 tasks. ToM-structured prompting uniformly increases performance by 3–10 points on Levels 2–3, validating the utility of explicit reasoning scaffolds. TMPO optimization delivers consistent improvements, with an optimized Qwen2.5-Omni-7B model surpassing closed-source baselines on the majority of tasks and improving rationale coherence and output faithfulness.

The benchmark and associated methodologies provide a high-resolution “Cognitive Compass” to localize and quantify reasoning breakpoints in AI systems’ affective intelligence. Extensions to larger, more omnimodal model backbones, broader social reasoning domains (e.g., negotiation, deception), and further refinement of reward structures for subtle mental state inference are identified as prominent avenues for future research (Luo et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HitEmotion.