MDRE for Speech Emotion Recognition

Updated 20 January 2026

The paper introduces MDRE, demonstrating that dual GRU encoders leveraging MFCCs, prosodic cues, and word embeddings significantly enhance emotion recognition performance.
The architecture fuses audio and text features by processing MFCCs and prosodic vectors alongside pre-trained GloVe embeddings, capturing multimodal emotional cues.
Empirical results on IEMOCAP show MDRE achieving 71.8% accuracy, outperforming unimodal baselines and proving its robustness even with moderate ASR errors.

The Multimodal Dual Recurrent Encoder (MDRE) is a deep neural architecture for speech emotion recognition, designed to integrate and jointly process both acoustic and linguistic modalities. Developed to address the limitations of unimodal speech emotion recognition systems that rely either on audio or text alone, MDRE leverages dual gated recurrent unit (GRU) networks to extract temporal representations from Mel-frequency cepstral coefficient (MFCC) features, prosodic cues, and transcript-based word embeddings. The fusion of signal-level and language-level features enhances emotion classification, as demonstrated by state-of-the-art results on the IEMOCAP benchmark for four-way emotion labeling (angry, happy, sad, neutral) (Yoon et al., 2018).

1. Architecture and Workflow

MDRE adopts a dual-stream architecture:

The audio stream receives input features $x^\mathrm{a} = (x_1^\mathrm{a}, \ldots, x_{T_a}^\mathrm{a})$ , including 39-dimensional MFCC vectors (12 static, log-energy, $\Delta$ , $\Delta\Delta$ ) per 25 ms window (10 ms hop), supplemented by a 35-dimensional prosodic vector $p$ (including $F_0$ , voicing probability, loudness contours). Maximum sequence length is $T_a = 750$ frames, variable per utterance.
The text stream processes word tokens $w = (w_1, \ldots, w_{T_t})$ , embedded via pretrained 300-dimensional GloVe vectors. Vocabulary size is approximately 3,747, including special tokens (_PAD_, _UNK_), with utterances padded/truncated to $T_t = 128$ .

Each stream is encoded by an independent unidirectional GRU:

$h_t^\mathrm{a} = \mathrm{GRU}_\mathrm{a}(x_t^\mathrm{a}, h_{t-1}^\mathrm{a}), \quad 1 \leq t \leq T_a,~h_0^\mathrm{a}=0$

$h_t^\mathrm{t} = \mathrm{GRU}_\mathrm{t}(x_t^\mathrm{t}, h_{t-1}^\mathrm{t}), \quad 1 \leq t \leq T_t,~h_0^\mathrm{t}=0$

The final states $h_{T_a}^\mathrm{a}$ and $h_{T_t}^\mathrm{t}$ are extracted. For audio, the final state is concatenated with prosodic $p$ to yield $e = [h_{T_a}^\mathrm{a}; p]$ .

A single-layer feed-forward transformation is applied to each representation:

$A = g_a(e) = \tanh(W_a e + b_a)$

$T = g_t(h_{T_t}^\mathrm{t}) = \tanh(W_t h_{T_t}^\mathrm{t} + b_t)$

These are concatenated to form the fused feature:

$h^{\mathrm{fusion}} = [A; T] \in \mathbb{R}^{2d}$

A classification head applies a fully connected layer and softmax:

$\hat{y} = \mathrm{softmax}\left( (h^{\mathrm{fusion}})^T M + b \right),\quad M \in \mathbb{R}^{2d \times C},~C=4$

2. Audio and Text Feature Processing

Audio feature extraction combines spectral (MFCC) and prosodic information, capturing both short-term spectral dynamics and longer-term suprasegmental features. MFCCs are generated with a 25 ms window and 10 ms hop, generating up to 750 frames per utterance. The prosodic vector is computed per utterance, incorporating fundamental frequency ( $F_0$ ), voicing probability, and loudness contours. At each time step, $x_t^\mathrm{a} \in \mathbb{R}^{39}$ ; the prosodic vector $p \in \mathbb{R}^{35}$ is appended at the sequence level.

Text input is tokenized (NLTK), mapped to GloVe embeddings, and processed via a GRU with matched dimensionality to the audio encoder. The embedding dimension is fixed at 300. Padding or truncation ensures fixed-length input per batch.

3. Fusion Strategy and Classifier

MDRE's fusion mechanism concatenates the transformed outputs from audio and text encoders. Each stream's representation passes through its respective feed-forward (tanh) nonlinearity, allowing task-specific feature shaping before fusion. Fused vectors are classified by a softmax layer corresponding to four emotion categories. The model is regularized by dropout (0.3–0.5) at both GRU and feed-forward layers and by weight decay ( $\leq 10^{-5}$ ). The loss function is cross-entropy:

$\mathcal{L} = - \sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log\hat y_{i,c}$

4. Training Protocol and Experimental Details

Training employs the IEMOCAP dataset, consisting of 5,531 utterances across 5 sessions and 10 speakers, annotated as angry (1,636), happy (1,103), sad (1,084), or neutral (1,708). The standard five-fold cross-validation protocol uses an 80:5:15 train:dev:test split per fold. Model optimization is performed with Adam ( $\text{lr} \approx 10^{-3}$ ), batch size around 32, for approximately 30–50 epochs. GRU weights are orthogonally initialized; GloVe embeddings are used for the text stream.

5. Empirical Results and Ablation Studies

Performance is assessed on IEMOCAP using both accuracy and weighted average recall (WAP). MDRE achieves a WAP/accuracy of $0.718 \pm 0.019$ (71.8%), surpassing prior state-of-the-art (68.8%). Comparisons with single-modality baselines show:

System	WAP / Accuracy
ARE (audio only)	$0.546 \pm 0.009$
TRE (text only)	$0.635 \pm 0.018$
MDRE	$0.718 \pm 0.019$

With ASR-generated transcripts ( $\text{WER} \approx 5.53\%$ ), the MDRE-ASR variant yields $0.691 \pm 0.019$ . Analysis of confusion matrices indicates that audio-only (ARE) models tend to confuse happy and neutral, while text-only (TRE) models sometimes conflate sad and happy (~16%). MDRE reduces sad-to-happy misclassification to ~9% and enhances all diagonal confusion-matrix entries.

6. Significance and Implications

MDRE demonstrates that fusing low-level acoustic and high-level textual features with dual GRU encoders can comprehensively represent the multimodal nature of emotional expression in speech. Its superior performance over unimodal models indicates the complementary value of prosodic and lexical information. The model generalizes well even with moderate ASR transcription error, as evidenced by strong MDRE-ASR results.

A plausible implication is that encoding and fusing heteromodal time-series via parallel sequence models can benefit other multimodal sequence classification tasks beyond speech emotion recognition, wherever both signal and language structure are essential for robust inference.

7. Limitations and Future Directions

All findings and configurations are established on IEMOCAP, leaving generalization to more spontaneous, in-the-wild corpora as an open question. The current architecture employs simple concatenation fusion; more sophisticated attention or gating mechanisms could potentially enhance cross-modal integration. As MDRE relies on accurate transcript alignment, high word-error rates in ASR may limit robustness. Extending to multi-language, larger-vocabulary, or real-time deployment scenarios will require further empirical validation.

For further details, see "Multimodal Speech Emotion Recognition Using Audio and Text" (Yoon et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Multimodal Speech Emotion Recognition Using Audio and Text (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Dual Recurrent Encoder (MDRE).