Emotion Recognition in Conversation (ERC)
- Emotion Recognition in Conversation (ERC) is a research field that automates the identification of emotions in multi-turn dialogues using context-aware algorithms.
- It leverages modern NLP techniques—such as transformer-based models, speaker modeling, graph neural networks, and prompt-based learning—to capture both local and global context.
- ERC has significant implications for creating empathetic AI systems and dialogue platforms while addressing challenges like context dependence, class imbalance, and emotion ambiguity.
Emotion Recognition in Conversation (ERC) constitutes a central research domain in affective computing and natural language processing, focusing on the automatic identification of emotions expressed in the utterances of multi-turn, multi-party dialogues. The overarching objective is to assign an emotion label—drawn from a defined set such as Ekman’s six basic emotions plus neutral, or task-oriented extensions—to each utterance within a conversational context, leveraging not only the immediate text but also the preceding (and sometimes following) dialogue history. ERC plays a pivotal role in the deployment of empathetic conversational agents, dialogue systems, and broader emotion-aware AI applications, and has catalyzed the development of a heterogeneous methodological landscape encompassing pre-trained transformer encoders, recurrent and graph neural architectures, distributional learning frameworks, logic-based models, and prompt-based LLMs (Pereira et al., 2023, Kim et al., 2021, Wu et al., 2022, Yu et al., 2024, Bao et al., 2022, Gendron et al., 2024, Fu et al., 2024, Liu et al., 2023, Agarwal et al., 2021, Augustine et al., 2022, Li et al., 2020, Van et al., 2024, Jiao et al., 2020, Yang et al., 2021, Shi et al., 2024, Yi et al., 2022, Li et al., 2022, Ghosh et al., 2022, Pereira et al., 2022, Poria et al., 2019).
1. Formal Task Definition and Research Challenges
ERC formalizes as follows: given a dialogue , where each is an utterance produced by a speaker , the task is to predict an emotion label for each , such that , where is a fixed emotion set. The function aims to minimize a loss, typically cross-entropy, over utterances (Pereira et al., 2022, Pereira et al., 2023, Poria et al., 2019). Core challenges include:
- Context Dependence: Emotional interpretation often requires modeling long-range contextual dependencies; predictions based solely on the isolated utterance are inadequate (Pereira et al., 2023, Pereira et al., 2022).
- Speaker Dynamics: Interplay between intra-speaker emotional inertia (self-dependence) and inter-speaker contagion (other-dependence) (Liu et al., 2023, Bao et al., 2022, Pereira et al., 2022).
- Emotion Shift and Ambiguity: Emotion shifts are abrupt and difficult to detect; annotator subjectivity yields inherent label noise and uncertainty (Liu et al., 2023, Wu et al., 2022, Agarwal et al., 2021).
- Fine-grained and Imbalanced Categories: Emotion classes are unequally distributed; rare or semantically similar emotions (e.g., excitement versus happiness) are easily confused (Yu et al., 2024, Van et al., 2024).
- Multiparty and Multimodal Complexity: Multispeaker interactions and multimodal signals (audio, visual) compound context modeling complexity (Van et al., 2024, Agarwal et al., 2021, Li et al., 2022).
2. Methodological Paradigms in ERC
The field encompasses a rich spectrum of architectures:
2.1 Transformer-based Utterance Encoding and Context Integration
Contemporary ERC models frequently employ pre-trained transformer LLMs (e.g., RoBERTa, BERT) to produce utterance embeddings. Recent advances inject context directly into the input of the encoder by concatenating previous turns:
This approach yields a context-aware [CLS] embedding, eliminating the need for separate context modules and simplifying the architecture (Pereira et al., 2023). CD-ERC exemplifies this design, establishing optimal performance with context turns and outperforming both context-independent RoBERTa baselines and elaborate graph/recurrent models on DailyDialog and EmoWOZ (Pereira et al., 2023).
2.2 Speaker Modeling and Intra/Inter-Speaker Dependencies
Effective ERC requires explicit modeling of speaker dependencies. EmoBERTa operationalizes this by prepending speaker identities to utterances, thereby enabling the transformer to learn both intra- and inter-speaker relations via self-attention alone (Kim et al., 2021). SGED (Speaker-Guided Encoder-Decoder) introduces dynamic, attention-based speaker state tracking modules that compute distinct intra- and inter-speaker influences, which are fused and fed to a GRU-based decoder (Bao et al., 2022). Masking mechanisms in hierarchical transformers further support fine-grained separation of speaker-dependent context (Li et al., 2020).
2.3 Graph Neural Networks, Hypergraphs, and Metric-Learning
ERC has seen significant innovation in graph-based representations: ConxGNN integrates a multi-scale heterogeneous graph with a hypergraph, capturing both local/global context and high-order cross-modal interactions, with a cross-modal attention fusion layer to align modalities (text, audio, visual). A class-balanced composite loss mitigates class imbalance and similar emotion confusion, achieving state-of-the-art weighted-F1 on IEMOCAP and MELD (Van et al., 2024). Separately, SentEmoContext frames ERC as a metric learning task, embedding context-aware utterances in a space structured by triplet loss, yielding the highest macro-F1 on DailyDialog with a lightweight, efficient pipeline (Gendron et al., 2024).
2.4 Uncertainty Estimation and Distributional Modeling
ERC models increasingly accommodate annotation subjectivity and label ambiguity by estimating the full distribution of possible emotions for each utterance. A Dirichlet prior network models the empirical distribution over annotator votes, optimized via a combination of negative log-likelihood and KL divergence, thus yielding well-calibrated uncertainty and improving not only accuracy but also the capacity to flag out-of-distribution or highly ambiguous utterances (Wu et al., 2022).
2.5 Contrastive, Curriculum, and Prompt-based Learning
EACL leverages emotion label encodings as geometric anchors in a contrastive representation space, equipped with auxiliary losses to maximally separate similar emotions. Adaptation of anchor positions post-hoc fine-tunes the classifier for maximal separation, drastically reducing confusion between similar classes (e.g., happy/excited) (Yu et al., 2024). Hybrid curriculum learning schedules both conversations and labels for model training—first exposing the model to conversations with few emotion shifts and soft-label distributions before increasing task entropy—thus enhancing robustness on difficult and minority-class utterances (Yang et al., 2021). Prompt-based techniques, such as CISPER or LaERC-S, explicitly blend dialogue context and external commonsense into the input prompt for LLMs, recasting ERC as a cloze or generative task and achieving state-of-the-art results (Yi et al., 2022, Fu et al., 2024).
3. Multimodal and Multilingual ERC
State-of-the-art ERC systems increasingly operate in the multimodal domain. Examples include:
- Multimodal Fusion: JFM-based architectures employ iterative cross-modal vector updates for deep fusion of text and audio representations, supervised by inter-class contrastive losses to enhance class separation and sample-balance (Shi et al., 2024).
- Emotion Capsules and Emoformer: EmoCaps fuses modality-specific emotion vectors with sentence embeddings into "emotion capsules," processed by context-aware modules (Bi-LSTM or DialogueRNN), achieving SOTA results on IEMOCAP and MELD (Li et al., 2022).
- Emotion Shifts and Multimodal Gating: Emotion shift modules, such as those described by (Agarwal et al., 2021), interleave emotion-shift gating with context-aware recurrent backbones, explicitly modeling abrupt emotional transitions.
- Multilingual Corpora and Transfer: M-MELD extends MELD to multiple languages, accompanied by discursive graph-based LSTM backends that exploit XLM-RoBERTa for zero-shot transfer and discourse parsing, generalizing across diverse linguistic regimes (Ghosh et al., 2022).
4. Evaluation Metrics, Benchmark Datasets, and Empirical Comparisons
ERC relies on macro-averaged and weighted F1 scores due to pronounced class imbalance. Datasets such as IEMOCAP, MELD, DailyDialog, and EmoryNLP span dyadic and multi-party dialogues, with 6–8 categorical emotion labels (see summary in (Pereira et al., 2022, Pereira et al., 2023, Van et al., 2024, Agarwal et al., 2021)). Strict experimental setups—5-fold cross-validation, multiple seeds, class-balanced sampling—are standard. State-of-the-art models (e.g., EACL, ConxGNN) report significant advances:
| Model | IEMOCAP wF1 | MELD wF1 | EmoryNLP wF1 | DailyDialog macro-F1 |
|---|---|---|---|---|
| DAG-ERC | 68.03 | 63.65 | 39.02 | 59.33 |
| EmotionIC | 69.61 | 66.32 | 40.25 | 54.19 |
| EACL | 70.41 | 67.12 | 40.24 | — |
| ConxGNN | 68.64 | 65.69 | — | — |
| CD-ERC (k=3) | — | — | — | 51.23 (DailyDialog) |
| SentEmoContext | — | — | — | 57.71 (DailyDialog) |
Empirical ablations highlight that contextualization (window size ), explicit speaker modeling, anchor-based contrastive learning, and curriculum design drive consistent gains across datasets (Pereira et al., 2023, Yu et al., 2024, Gendron et al., 2024, Yang et al., 2021, Bao et al., 2022).
5. Subjectivity, Annotation, and Distributional Grounding
Annotation subjectivity remains a pivotal concern. Modeling strategies include:
- Soft-Label and Distributional Learning: Directly fitting empirical annotation distributions or modeling uncertainty via Bayesian approaches (Dirichlet prior) yields improved calibration and retention of ambiguous cases, as well as higher AUPR for uncertainty detection (Wu et al., 2022).
- Logical/Relational Inference: Probabilistic Soft Logic (PSL) formulas propagate emotion labels over dialogue structure and text similarity, using neural embeddings as soft evidences; this collective inference corrects noisy or ambiguous human annotations and delivers large gains on class-imbalanced corpora (Augustine et al., 2022).
- Few-Shot and Minority-Class Robustness: Pre-training context-dependent encoders on unsupervised data—e.g., via the Conversation Completion (ConvCom) task—boosts minority-class recognition and generalization to low-resource settings (Jiao et al., 2020, Gendron et al., 2024).
6. Major Advances, Limitations, and Future Directions
Recent ERC models consistently:
- Leverage context-dependent utterance encoding, often by integrating conversational history directly into the transformer input (Pereira et al., 2023, Kim et al., 2021, Yi et al., 2022).
- Explicitly model speaker identity and dynamic intra-/inter-speaker dependencies for improved emotion prediction, especially in multiparty dialogues (Kim et al., 2021, Bao et al., 2022, Li et al., 2020).
- Address class imbalance, subjective annotation, and ambiguous utterances via contrastive/diversity-enhancing losses, distributional supervision, and logic-based relational models (Yu et al., 2024, Wu et al., 2022, Augustine et al., 2022).
- Achieve new state-of-the-art performance through anchor-guided contrastive learning, multi-scale GNN/hypergraph fusion, curriculum learning, and prompt-based LLM tuning (Yu et al., 2024, Van et al., 2024, Yang et al., 2021, Yi et al., 2022, Fu et al., 2024).
Open problems persist with multimodal fusion (especially for visual/facial cues), real-time and online ERC, continuous emotion representation, and scalable generalization to multilingual, code-mixed, or out-of-domain conversations (Van et al., 2024, Agarwal et al., 2021, Ghosh et al., 2022, Pereira et al., 2022).
Future research directions include hierarchical and dynamic graph construction, adaptive window/context scaling, cross-modal external knowledge infusion (e.g., commonsense graphs), and subjectivity-embracing architectures for more interpretable and uncertainty-aware emotion recognition (Pereira et al., 2022, Fu et al., 2024, Wu et al., 2022, Van et al., 2024, Yi et al., 2022, Yang et al., 2021).