Open-Vocabulary Multimodal Emotion Recognition
- MER-OV is a computational paradigm that uses audio, visual, and textual data to generate open, context-rich emotion descriptors.
- It leverages state-of-the-art multimodal large language models and fusion networks to overcome the limitations of fixed-label emotion recognition.
- The framework employs hybrid LLM-human annotation pipelines and set-based evaluation metrics to ensure robust and scalable emotion analysis.
Open-Vocabulary Multimodal Emotion Recognition (MER-OV) refers to the computational task of inferring human emotion states from multimodal data—commonly visual (video/image), auditory (speech, environmental sounds), and textual (transcript)—and expressing predictions as arbitrary emotion descriptors taken from an open, essentially unlimited vocabulary, rather than a fixed set of emotion categories such as Ekman's six classes. This paradigm is motivated by the recognition that fixed-label sets cannot accommodate the complexity, subtlety, and multi-appraisal nature of actual human emotional experience, as revealed in both affective computing and contemporary cognitive science (Lian et al., 2024, Han et al., 24 Dec 2025). Modern MER-OV frameworks leverage multimodal LLMs (MLLMs), cross-modal fusion networks, or embedding-based architectures to achieve fine-grained, context-rich, and scalable emotion recognition, often incorporating explanation or justification for predictions.
1. Motivation and Formal Problem Definition
Traditional multimodal emotion recognition (MER) systems operate over predefined label sets (e.g., "happy", "sad", "angry") and are constrained by taxonomic rigidity, annotation bottlenecks, and poor coverage of minority or nuanced emotional states (Lian et al., 2024, Lian et al., 2024). The MER-OV paradigm formalizes the prediction task as learning a function
where denote the audio, visual, and text streams of a sample, and is the space of all possible emotion terms (words, phrases, sentences). Unlike one-hot or multi-label schemes, MER-OV requires generation (and justification) of any number and type of emotion descriptors, accommodating compound, subtle, and dynamic affect (Han et al., 24 Dec 2025, Lian et al., 2023).
Key drivers include: (1) psychological theories estimating tens of thousands of distinct human emotions (Lian et al., 2024); (2) the need for models that adapt to cultural, situational, and subjective variance; (3) limitations of annotation protocols reliant on majority voting or closed taxonomies (Lian et al., 2024).
2. Datasets and Annotation Methodologies
MER-OV requires datasets annotated with open-vocabulary emotion labels, often relying on hybrid human–LLM procedures due to the infeasibility of exhaustive manual labeling. Notable datasets include OV-MERD (Lian et al., 2024, Han et al., 24 Dec 2025), MER2024-OV (Lian et al., 2024, Ge et al., 2024), and benchmarks derived from EMER (Lian et al., 2023). Annotation typically involves:
- Multi-stage LLM–human pipelines: Visual and acoustic clues are extracted by dedicated LLMs (e.g., GPT-4V for video, SALMONN for audio), validated or augmented by experts, and merged into unified, free-form multimodal descriptions (Lian et al., 2024).
- Label extraction: Final emotion sets are mined by LLMs from these descriptions, sometimes translated and synonym-grouped to ensure consistency (Han et al., 24 Dec 2025).
- Taxonomic mapping and grouping: Labels are grouped into equivalence classes using taxonomy-driven (e.g., Parrott's tree) or embedding-based clustering, permitting semantic evaluation (Wu et al., 26 Sep 2025).
Dataset statistics:
| Dataset | Samples | Modalities | Unique Labels | Labels/sample |
|---|---|---|---|---|
| OV-MERD | 332 | A,V,T | 248 | ~3.3 |
| MER2024-OV | 332 | A,V,T | 301 | ~2.92 |
| EMER | 332 | A,V,T | 301 | ~3 |
Expert annotation, LLM verification, and synonym expansion procedures are standard; this enables robust, reproducible open-vocabulary label sets for benchmarking (Lian et al., 2023, Lian et al., 2024).
3. Model Architectures and Fusion Strategies
MER-OV models predominantly employ multimodal deep neural architectures, often leveraging LLMs for decoding free-form emotion descriptions or label sets. Prominent frameworks:
- MLLM Fusion: Modern benchmarks demonstrate the best results using a two-stage fusion methodology, extracting modality-specific "emotion clues" (visual gestures, audio pitch, textual keywords) via dedicated MLLMs (e.g., InternVL2.5, Qwen2-Audio) before projection and final label-generation by a strong LLM (Han et al., 24 Dec 2025, Ge et al., 2024). Trimodal approaches (A,V,T) yield Fâ‚› up to 61.0%, with video found to be the dominant modality.
- Label Encoder-Guided CLIP-based Models: MER-CLIP uses frozen CLIP text encoders as label embedding generators, supporting arbitrary textual descriptions as labels and enabling open-vocabulary cosine-similarity classification (Song et al., 1 Jun 2025).
- Explainable EMER Models: EMER and variants generate evidence-based explanations from multimodal features, then mine open-vocabulary labels directly from these natural-language summaries (Lian et al., 2023, Zhang, 2024).
- Continuous-Valence Models: Some frameworks embed emotion states in continuous Valence–Arousal–Dominance (VAD) spaces, with nearest-neighbor retrieval for open-vocab label generation (Jia et al., 2024).
- Semi-supervised Fusion Backbones: Methods such as Conv-Attention combine convolutional and attention-based fusion branches, using pseudo-labeled data to increase coverage and robustness (Cheng et al., 2024).
4. Evaluation Metrics and Benchmarking
MER-OV evaluation requires metrics sensitive to variable-length open sets and semantic similarity between predictions and ground truth. Core metrics include:
- Set-level Precision and Recall: Grouping labels via synonym expansion or taxonomy, precision and recall are defined as
where and are sets of grouped ground-truth and predicted labels (Lian et al., 2023, Lian et al., 2024).
- Emotion-Wheel (EW) Proximity: Some models (e.g., AffectGPT-R1) optimize and evaluate directly on EW-based semantic distance metrics, scoring predicted terms by their proximity in a continuous affective space (Lian, 2 Aug 2025).
- Open-vocabulary accuracy and recall: Exact match and recall of ground-truth descriptors (points, phrases, sentences), sometimes as P@k or mAP for ranked predictions (Ge et al., 2024, Cheng et al., 2024).
- Qualitative judgment tasks: Custom emotion statement judgment protocols (e.g., ESJ) evaluate interpretation, context, and subjectivity via human-guided correctness labels (Wu et al., 26 Sep 2025).
Representative benchmark results:
| Model | Precisionâ‚› | Recallâ‚› | Fâ‚› / Avg |
|---|---|---|---|
| GPT-4V (multimodal) | 48.5 | 64.9 | 55.5 |
| Conv-Attention/Emotion-LLaMA | 69.61 | 62.59 | 66.10 |
| AffectGPT-R1 (EW metric) | — | — | 66.35 |
| EMER(Multi) (upper bound) | 80.05 | — | 80.05 |
State-of-the-art models consistently outperform randomly-selected baselines and closed-set classifiers, achieving robust performance on both primary and nuanced emotional states (Lian et al., 2024, Han et al., 24 Dec 2025).
5. Key Innovations and Qualitative Advances
MER-OV architectures exhibit several core advances:
- Open-vocabulary generalization: Label encoders (e.g., CLIP-based, LLM-driven) enable prediction on arbitrary descriptors (words, phrases, sentences), eliminating dependence on fixed output heads (Song et al., 1 Jun 2025, Lian et al., 2024).
- Context and subjectivity modeling: Automated pipelines (e.g. INSETS) generate statements factoring context, roles/personas, and perception subjectivity, increasing the fidelity of emotion classification (Wu et al., 26 Sep 2025).
- Fine-grained multimodal cue extraction: Models like MicroEmo integrate global-local attention over facial regions and utterance-aware Q-Formers, improving temporal and contextual granularity in prediction (Zhang, 2024).
- Semi-supervised and reinforcement learning for metric optimization: Some frameworks (e.g., AffectGPT-R1) employ reinforcement learning to maximize non-differentiable EW metrics, while others use pseudo-labeling and sample weighting to expand annotation coverage (Lian, 2 Aug 2025, Cheng et al., 2024).
Qualitative observations consistently report correct retrieval of minority or composite labels ("frustrated", "resigned", "exuberant", etc.) and appropriate alignment between model explanations and multimodal evidence (Lian et al., 2023, Ge et al., 2024).
6. Limitations, Open Problems, and Future Directions
While MER-OV frameworks surpass closed-set models in label diversity and expressiveness, several limitations persist:
- Dataset scale: Most benchmarks (e.g., OV-MERD, MER2024-OV) remain much smaller than required for comprehensive training and cross-cultural generalization (Lian et al., 2024, Lian et al., 2023).
- Reliance on proprietary LLM APIs: Annotator pipelines depend extensively on GPT-series or similar models, challenging reproducibility (Lian et al., 2024).
- Subjectivity and reliability: Current annotation and evaluation protocols inadequately capture inter-annotator consistency, personalized emotion profiles, and explanation faithfulness (Wu et al., 26 Sep 2025).
- Fusion and prompt bottlenecks: Optimal fusion strategies (e.g., two-stage, Conv-Attention) and prompt designs (e.g., self-consistency, few-shot) remain highly empirical and model-dependent (Han et al., 24 Dec 2025, Cheng et al., 2024).
- Metric alignment: Absence of differentiable metrics for open-vocabulary set similarity restricts end-to-end optimization, though RL-based approaches (AffectGPT-R1) begin to address this gap (Lian, 2 Aug 2025).
Future research will likely focus on: (1) scalable, multilingual, and cross-domain dataset expansion; (2) unsupervised or active annotation strategies; (3) integrated architectures enabling continual, open-set generalization; (4) advanced metric-learning for fine-grained label alignment; and (5) human-in-the-loop evaluation to optimize subjectivity and explanation quality (Lian et al., 2024, Wu et al., 26 Sep 2025).
7. Historical Trajectory and Impact
The MER-OV paradigm emerged in response to the limitations of majority-vote, fixed-class emotion recognition in audio-visual-textual corpora, bolstered by psychological evidence for emotion complexity, and the rise of generative MLLMs capable of unrestricted, contextually-justified label production (Lian et al., 2024, Lian et al., 2023). Early benchmarks formalized set-based precision/recall metrics and user-centered annotation pipelines; subsequent innovations generalized to metric-learning, RL optimization, and multimodal fusion (Song et al., 1 Jun 2025, Lian, 2 Aug 2025, Jia et al., 2024). MER-OV now serves as a critical foundation for affective computing, empathetic human-AI interfaces, and personalized emotion-based applications, demanding continual advances in scale, reliability, and interpretability.
MER-OV research is now at the forefront of multimodal understanding, establishing benchmarks and architectures for fine-grained, semantically rich, and context-sensitive emotion AI systems, with broad cross-disciplinary implications for cognitive science, HCI, and affective computing (Han et al., 24 Dec 2025, Lian et al., 2024, Lian et al., 2023).