HitEmotion: ToM-Based Affective Benchmark
- HitEmotion is a hierarchical benchmark that defines three levels—emotion perception, understanding, and cognition—to assess models’ ability to simulate beliefs, desires, and intentions.
- It utilizes both closed-label and generative evaluation protocols, enabling precise measurement of cognitive depth and context-aware emotional reasoning.
- Combining 20,114 test instances from 24 datasets, it employs process-level supervision and GRPO to diagnose and improve cognitive breakpoints in AI models.
HitEmotion is a Theory-of-Mind (ToM)-grounded hierarchical benchmark for multimodal affective intelligence, designed to diagnose cognitive depth limitations in state-of-the-art multimodal LLMs (MLLMs). Unlike traditional sentiment or emotion datasets, HitEmotion is explicitly structured to evaluate a model’s capacity for simulating the mental states—beliefs, desires, intentions—necessary for deep, contextually-aware emotional reasoning. The benchmark formalizes emotion understanding tasks across three developmental ToM stages, providing both closed-label and generative evaluation protocols, and offers practical tools and process-level supervision methods for improving model faithfulness and rationale coherence (Luo et al., 1 Feb 2026).
1. Motivation and Theory-of-Mind Foundation
Affective intelligence in AI extends beyond cue-based pattern recognition and requires explicit modeling of Theory of Mind, the psychological construct denoting the ability to represent, simulate, and reason about the mental states of others. In this framework, genuine emotion analysis arises from recursive inference chains—simulating what a subject knows, feels, or intends—rather than mapping directly from multimodal surface features. HitEmotion operationalizes this paradigm by embedding hierarchical ToM reasoning requirements into its benchmark, arranging tasks into levels that mirror developmental progressions in ToM (from first-order belief inference through second-order recursive mind modeling). This design enables systematic measurement of breakdown points in AI models' capacity for affective cognition. The absence of explicit ToM scaffolding, as demonstrated, reduces models to shallow retrievers susceptible to conflicting or misleading signals (Luo et al., 1 Feb 2026).
2. Hierarchical Benchmark Structure
HitEmotion structures its task suite into three levels, each defined by increasing ToM cognitive depth. Let represent multimodal inputs (Text, Audio, Video), with outputs consisting of a reasoning chain and a final answer , formalized as .
- Level 1: Emotion Perception & Recognition (EPR) First-order mappings from perceptual signals (image, sound, text) to explicit emotion or sentiment classes. Representative tasks (10 total): Face Expression Sentiment Detection, Speech Emotion Recognition, Image Sentiment Analysis, Opinion Sentiment Analysis, etc.
- Level 2: Emotion Understanding & Analysis (EUA) Relational and contextual mind modeling: inferring not only emotional states but underlying intent, function, or social stance. Representative tasks (8 total): Persuasion detection in memes, Humor Understanding, Multiparty Dialogue Emotion Recognition, Multimodal Aspect-Based Sentiment Analysis.
- Level 3: Emotion Cognition & Reasoning (ECR) Causal and second-order recursive reasoning: tasks demand explanation of causes, temporal and intentional dynamics, and nonliteral constructs (e.g., sarcasm, laughter). Representative tasks (6 total): Emotion Elicitation Reasoning, Sarcasm Detection, Sentiment Flip Analysis, Multimodal Emotion Cause Pair Extraction.
This stratification enables the identification of “cognitive breakpoints,” where model performance declines as ToM complexity increases.
3. Dataset Aggregation and Annotation Protocol
HitEmotion consolidates and restructures 24 publicly released datasets, producing 20,114 standardized test instances in a unified closed-label Q&A format with MCQ or short generative answers. Modalities span static images, 16-frame video clips, audio segments, and text. Annotation for each sample consists of a “prompt–answer–context” triplet, maintaining original labels while enforcing format consistency. Example: for face expression sentiment recognition, the prompt may request micro-expression and prosody decoding, offering fixed answer choices (e.g., {neutral, positive, negative}). One-third of each dataset undergoes dual annotator cross-review and arbitration to ensure rigorous “prompt → chain → answer” alignment, and only official test splits are used to prevent data leakage. This ensures high-quality grounds for both training and evaluation (Luo et al., 1 Feb 2026).
4. Evaluation Metrics and Protocol
Each benchmark level employs adapted metrics to suit its cognitive demands:
| Level | Key Metrics | Output Types |
|---|---|---|
| Level 1 | Accuracy (ACC), Weighted-Average F1 (WAF) | Closed-label |
| Level 2 | ACC, WAF, Micro F1 (MF) | Multi-label/structural |
| Level 3 | ACC/WAF (classification), Exact Match F1 (EMF), LLM Semantic Score (open-form) | Short- and open-form |
Definitions:
- Accuracy:
- Weighted-Average F1:
,
- Micro F1:
Performance breakdown across model baselines (Gemini-2.5-Pro, GPT-4.1, open-source models) indicates progressive degradation in cognitive-emotional reasoning:
- Level 1: ACC ≈ 65–78% (some tasks > 70%)
- Level 2: ACC/MF ≈ 50–60% (only 2/8 tasks > 60%)
- Level 3: 30–55% (no task above 60%)
This reveals clear bottlenecks as ToM requirements escalate (Luo et al., 1 Feb 2026).
5. ToM-Guided Reasoning Chain and TMPO Optimization
5.1 Reasoning Chain Generation
For each input and prompt , reasoning is decomposed into a structured chain , with each step explicitly tied to ToM operations:
- Perceptual Simulation (signal decoding)
- Cognitive Empathy (hypothesis synthesis)
- Perspective Taking (viewpoint attribution)
- (When required) Causal Attribution / Recursive Mind Modeling
Pseudocode excerpt:
1 2 3 4 5 6 |
Given input X=(T,A,V) and task prompt P: r ← [] for step in prompt-defined ToM steps: r.append(model.generate_step(step, X)) o ← model.generate_answer(r) return (r, o) |
5.2 ToM-Aligned Supervised Fine-Tuning (SFT)
During SFT, reasoning steps are encapsulated with > … tags and labels with <answer>…</answer>, optimizing:
5.3 ToM-Based Preference Optimization (GRPO)
Group-wise Reward Policy Optimization (GRPO) refines the model by sampling chain–answer candidates and computing a composite reward:
- Structure reward assesses XML formatting and correct step sequence.
- Content reward applies the primary task metric (e.g., ACC, MF, EMF).
- Process reward favors ToM keyword usage (belief, intention, desire).
- Consistency reward penalizes contradictions as judged by an LLM.
The GRPO objective with a KL constraint is:
where denotes group-normalized reward advantage, and is the SFT policy (Luo et al., 1 Feb 2026).
6. Performance Insights and Research Impact
HitEmotion reveals systematic cognitive deficits in contemporary MLLMs: performance declines as ToM depth increases, with no baseline exceeding 60% on the most complex Level 3 tasks. ToM-structured prompting uniformly increases performance by 3–10 points on Levels 2–3, validating the utility of explicit reasoning scaffolds. TMPO optimization delivers consistent improvements, with an optimized Qwen2.5-Omni-7B model surpassing closed-source baselines on the majority of tasks and improving rationale coherence and output faithfulness.
The benchmark and associated methodologies provide a high-resolution “Cognitive Compass” to localize and quantify reasoning breakpoints in AI systems’ affective intelligence. Extensions to larger, more omnimodal model backbones, broader social reasoning domains (e.g., negotiation, deception), and further refinement of reward structures for subtle mental state inference are identified as prominent avenues for future research (Luo et al., 1 Feb 2026).