DialogueCRN: Cognitive Emotion Recognition
- DialogueCRN is a neural architecture that models dual-process cognition by alternating fast, intuitive retrieval with slower, logical reasoning for ERC.
- It employs multi-turn reasoning using attention mechanisms and LSTM-based context fusion to integrate both speaker and situation-level cues.
- Empirical evaluations show significant performance gains over sequential and graph-based baselines across IEMOCAP, SEMAINE, and MELD datasets.
DialogueCRN (Dialogue Contextual Reasoning Network) is a neural architecture for Emotion Recognition in Conversations (ERC) that operationalizes dual-process cognitive theory via iterative, multi-turn context reasoning. By explicitly modeling both fast, associative retrieval and slow, logical reasoning over dialogue context, DialogueCRN advances ERC performance, outperforming both sequential and graph-based baselines across diverse benchmarks by substantial margins (Hu et al., 2021). The model is motivated by the observation that effective emotion recognition requires simulating how humans retrieve and reason about emotional clues in conversational history, integrating both situation-level and speaker-level context into the classification process.
1. Cognitive Motivation and Problem Decomposition
DialogueCRN is inspired by the Cognitive Theory of Emotion (Schachter & Singer, 1962; Scherer et al., 2001), positing that emotional inference involves dual cognitive stages: (1) an “intuitive retrieving” phase, characterized by rapid, associative recall of contextually relevant cues, and (2) a “conscious reasoning” phase, where retrieved information is logically organized and updated within working memory. Existing ERC models, such as DialogueRNN and DialogueGCN, often attend once over static context or pool features via simple attention, lacking the iterative recall-and-reasoning trajectory found in human cognition. DialogueCRN addresses this gap by decomposing emotion recognition into alternating retrieval and reasoning steps for each utterance, imitating the human process of sequential cognitive appraisals (Hu et al., 2021).
2. Model Architecture
The DialogueCRN architecture consists of three main stages: (A) utterance-level feature extraction and context memory construction; (B) multi-turn reasoning comprising iterative intuitive and conscious phases; (C) final emotion classification via context fusion.
A. Utterance and Context Representation
- Each utterance is mapped to a 300-dimensional GloVe embedding sequence and encoded by a convolutional neural network (CNN) using filter widths with $50$ feature maps each. Max-pooling and ReLU activation are applied before concatenation and projection to a dimensional vector.
- The sequence of utterance vectors is processed by a bi-directional LSTM, yielding situation-level context and hidden states .
- For each speaker , a separate bi-LSTM encodes the subsequence of utterances by , yielding speaker-level context .
- Situation and speaker-level contexts are projected to form global memories and 0, respectively, stored as matrices 1 and 2.
B. Multi-Turn Reasoning Module
For each utterance and context type, DialogueCRN executes 3 reasoning turns, characterized by repeated cycles:
- Intuitive Retrieving: Attention scores 4 compute the affinity between current query state 5 and memory slot 6. Attention weights 7 are computed via softmax, and retrieval 8 aggregates contextual clues.
- Conscious Reasoning: The working memory 9 and query update $50$0 are computed by an LSTM using previous query and memory.
- Aggregation: The next query $50$1 fuses reasoned and retrieved information.
Initialization is given by $50$2 and $50$3. After $50$4 turns, outputs $50$5 and $50$6 capture fused situation- and speaker-level information.
C. Final Classification
The outputs from both reasoning modules are concatenated $50$7 and passed through a softmax classifier: $50$8. Training minimizes cross-entropy loss over all utterances.
3. Mathematical Formulation and Pseudocode
Key operations are:
- Attention scoring: $50$9
- Softmax weights: 0
- Memory update: 1
- Output and loss: 2, 3
A compact pseudocode outline for one cognitive phase is as follows:
8 All terms and update steps are as described in (Hu et al., 2021).
4. Experimental Evaluation and Results
DialogueCRN is evaluated across three public ERC datasets: IEMOCAP (six discrete emotions; dialog length ≈50), SEMAINE (continuous attributes: valence, arousal, expectancy, power), and MELD (seven emotions; multi-party dialogue; average length ≈10). Baseline comparisons include TextCNN, MemNet, bc-LSTM+Att, CMN, ICON, DialogueRNN, and DialogueGCN.
Summary of empirical findings:
| Dataset | Metric | DialogueGCN | DialogueCRN |
|---|---|---|---|
| IEMOCAP | Accuracy | 64.02% | 66.05% |
| Weighted F₁ | 63.65% | 66.20% | |
| Macro F₁ | 63.43% | 66.38% | |
| SEMAINE | MAE (arousal) | 0.171 | 0.152 |
| MELD | Accuracy | 58.10% | 60.73% |
| Weighted F₁ | 55.56% | 58.39% | |
| Macro F₁ | 32.98% | 35.51% |
On IEMOCAP, DialogueCRN improves macro-F₁ over DialogueGCN by 2.95 points (66.38% vs. 63.43%). On MELD, the accuracy and weighted-F₁ improvements are ~2–3%. On SEMAINE, MAEs decrease by 1.1–11.1% depending on the emotion attribute (Hu et al., 2021).
5. Ablation Analysis and Model Insights
Systematic ablations establish that:
- Removing the cognitive phase (i.e., disabling multi-turn reasoning) reduces IEMOCAP performance by 4–6% F₁ and significantly worsens SEMAINE MAE.
- Omitting either situation-level or speaker-level cognition diminishes accuracy, evidencing criticality of both context types.
- Deleting the BiLSTM context memory drops the model to simple TextCNN accuracy, underscoring the necessity for global memory.
- Multi-turn, but shallow, reasoning achieves optimal performance: 4 and 5 for IEMOCAP; 6, 7 for SEMAINE. Excessive reasoning depth offers no further improvement (Hu et al., 2021).
The iterative design, cycling between fast retrieval and slow reasoning, allows DialogueCRN to extract logical cause-effect chains within conversational context. The model’s dual-stream architecture directly models speaker- and situation-specific influences. Empirical evidence shows that the architecture dynamically selects and aggregates emotional clues, outperforming static and graph-based approaches.
6. Comparative Impact and Theoretical Significance
DialogueCRN advances ERC by integrating cognitive principles of emotion appraisal into neural architectures. By explicitly modeling both intuitive recall and logical reasoning processes, DialogueCRN overcomes the limitations of single-shot attention and static context pooling found in prior models—such as DialogueRNN and DialogueGCN—which do not iteratively refine context or memory. The model demonstrates robust generalization by exceeding the performance of graph-based and sequential baselines across tasks involving both categorical and continuous emotion prediction. This suggests that cognitive-inspired, multi-turn reasoning approaches are effective for extracting causally and contextually grounded emotional triggers within dialogue.
7. Implementation Details and Training Protocol
Utterance encoders use a CNN over GloVe-300 embeddings with three filter widths. BiLSTM layers for perceptive context construction are two-layer on IEMOCAP/SEMAINE and one-layer on MELD. The cognitive LSTM is single-layer for all datasets. Adam is used for optimization with learning rates 1e–4 (IEMOCAP) and 1e–3 (SEMAINE, MELD), batch size 32, dropout 0.2, L2 decay in the range 2e–4 to 5e–4, and early stopping after 20 epochs without validation loss improvement. All design choices and their effects are detailed in (Hu et al., 2021).