Emotion Profile Refinery (EPR)

Updated 14 December 2025

EPR is an iterative segment-level soft-label refinement approach that captures the temporal dynamics and mixtures of emotions in speech.
It uses pseudo-one-hot regularization and iterative retraining, achieving significant accuracy improvements over standard hard-label methods.
The method leverages VGG-style CNNs on log-Mel spectrogram segments, enabling detailed analysis of intra-utterance emotion fluctuations.

The Emotion Profile Refinery (EPR) is an iterative, segment-level soft-label refinement procedure designed to address the intrinsic ambiguity and impurity of human emotions in speech. Rather than constraining emotion recognition to a single, hard categorical label per utterance, EPR models the temporal evolution and mixtures of emotions by producing probabilistic emotional profiles across speech segments. By iteratively re-training segment-level classifiers with dynamically updated soft labels—optionally regularized with hard targets—EPR achieves significant accuracy improvements in speech emotion classification, as demonstrated on several benchmark corpora (Mao et al., 2020).

1. Emotional Profiles: Definition and Motivation

The EPR framework is motivated by the recognition that human emotions in speech rarely present as pure, temporally homogeneous states. Emotional impurity refers to the phenomenon where utterances contain mixtures of basic emotions, varying within and across segments. Hard-label annotation, the prevailing paradigm in most SER systems, fails to capture this intra-utterance variability. EPR instead employs Emotional Profiles (EPs), which consist of a sequence of segment-level probability vectors, each reflecting the soft assignment over $K$ emotion classes for a given segment.

For an utterance $U$ segmented into $N$ contiguous segments, the emotional profile is constructed as:

$P_i = [p_i(e_1), p_i(e_2), \dots, p_i(e_K)]^T \in \mathbb{R}^{K \times 1}$

$U_{\mathrm{EP}} = [P_1, P_2, \dots, P_N] \in \mathbb{R}^{K \times N}$

Each row of $U_{\mathrm{EP}}$ represents the temporal trajectory of the probability assigned to a particular emotion class throughout the utterance. Using EPs enables the system to:

Capture intra-utterance emotion dynamics,
Represent soft mixtures rather than enforcing hard categorical choices,
Provide richer features for downstream utterance-level classification via statistics such as mean, variance, and percentiles of the emotion probability trajectories.

2. Emotion Profile Refinery (EPR): Algorithmic Procedure

EPR is formulated as an iterative learning scheme over segment-level classifiers. Each round of refinement leverages the predictions from the previous round (the "refinery stage"), updating the target labels for each segment, and mitigating the limitations of relying solely on hard utterance-level labels.

Algorithmic Steps

The EPR process is specified as follows:

Initialization: Each utterance $U^{(m)}$ is segmented into $N_m$ segments $\{s_i^{(m)}\}_{i=1}^{N_m}$ , and each segment is initially labeled with the pseudo one-hot utterance label $y^{(m)}$ .
Iterative Refinement ( $t = 1$ $t = 1$ to $T$ $T$ ): a. Train segment-level classifier $C_t$ $C_{t}$ on all segments using targets $q_i^{(m), t-1}$ $q_{i}^{(m), t - 1}$ .
- For $t=1$ : $q_i^{(m), 0} = y^{(m)}$ (hard label).
- For $t>1$ $t > 1$ :
  - Standard EPR (sEPR): $q_i^{(m), t-1} = p_i^{(m), t-1}$ (pure soft label).
  - Pseudo-one-hot EPR (pEPR): $q_i^{(m), t-1} = \mathrm{Normalize}(p_i^{(m), t-1} + y^{(m)})$ , i.e., summing the prior round's soft output with the original label and renormalizing.

b. Optimize $C_t$ by minimizing the cross-entropy loss:

$\mathcal{L}_t = -\sum_{m=1}^M \sum_{i=1}^{N_m} \sum_{k=1}^K q_i^{(m), t-1}[k] \log p_i^{(m), t}[k]$

c. After each stage, update all segment predictions.

Output: The final emotional profile for $U^{(m)}$ is assembled from the last-stage predictions: $U_{\mathrm{EP}}^{(m)} = [p_1^{(m), T}, p_2^{(m), T}, \dots]$ .
Utterance-level Aggregation: EPs are reduced to fixed-length feature vectors using statistics (mean, std, percentiles), subsequently used to train a Random Forest classifier for utterance-level prediction.

This design allows segment-level distributions to evolve over multiple stages, with the pEPR variant providing regularization against label collapse.

3. Segment-Level Classifier Architecture

The segment-level classifier operates on short-term log-Mel spectrogram representations ("images") constructed from $32$ frames ( $\approx 335$ ms) and 64 Mel bins. Each segment is processed through a VGG-style convolutional architecture ("configuration E"):

Conv(64, 3×3) – Conv(64, 3×3) – Pool(2×2)
Conv(128, 3×3) – Conv(128, 3×3) – Pool(2×2)
Conv(256, 3×3)×3 – Pool(2×2)
Conv(512, 3×3)×3 – Pool(2×2)
Conv(512, 3×3)×3 – Pool(2×2)
Fully connected (4096) – Fully connected (4096) – Fully connected ( $K$ ) – Softmax

Training employs Adam optimization (lr = 0.001, decay 0.8 every 2 epochs), batch size 128, for up to 20 epochs with early stopping (patience = 3). The cross-entropy loss is reused for segment-level training at each refinement stage.

4. Experimental Protocol and Datasets

EPR was evaluated on three public benchmark datasets:

Corpus	Language	Utterances	Emotion Classes	Speakers
CASIA	Mandarin	7,200	6	4
Emo-DB	German	535	7	10
SAVEE	English	480	7	4

Frames were extracted using STFT (window 25 ms, hop 10 ms, FFT=512), mapped to 64 log-Mel bins, and grouped into 32-frame segments. Segment-level training and utterance-level classification both used 10-fold cross-validation. Evaluation metrics included Weighted Accuracy (WA) and Unweighted Accuracy (UA), with WA = UA for balanced classes (CASIA).

5. Quantitative and Qualitative Results

Quantitative Findings

Model	CASIA (WA/UA)	Emo-DB (WA)	Emo-DB (UA)	SAVEE (WA)	SAVEE (UA)
Baseline VGG	93.10%	83.00%	82.36%	70.63%	69.88%
sEPR (soft labels)	93.67%	—	—	—	—
pEPR (hard+soft labels)	94.83% (2nd)	88.04% (5th)	—	77.08% (4th)	—

sEPR: Small gains in a single iteration, followed by performance collapse (predictions drift toward uniformity).
pEPR: Larger, sustained gains; best performance generally achieved at early-intermediate iteration (CASIA, 2nd; Emo-DB, 5th; SAVEE, 4th).

Ablation analysis found that pseudo-one-hot assisted (pEPR) provides the largest gain, while dynamic argmax labels and static soft labels resulted in minor improvement or degradation, respectively. No formal statistical significance testing was reported.

Qualitative Observations

After two sEPR refinements, class probabilities collapse to $1/K$ (loss of discriminative power).
pEPR maintains sharp peaks correlated with the ground truth and exhibits plausible temporal emotion fluctuations across segments.
Post-pEPR, confusion matrices display reduced off-diagonal confusions, e.g., between anger and frustration.

6. Discussion: Advantages, Limitations, and Extensions

EPR leverages fine-grained, segment-level probabilistic supervision, encoding temporal and mixture uncertainty in speech emotion classification. By introducing an iterative learning procedure with hard-label regularization (pseudo-one-hot), EPR prevents over-smoothing and overfitting associated with refining on pure soft labels alone.

Primary limitations include computational overhead from repeated segment-level training, as well as a dependency on 1:1 mixing of hard and soft labels (fixed heuristic weighting). Evaluation is currently limited to acted, small-scale corpora; extension to naturalistic and larger datasets may require further adaptation.

Proposed extensions include:

Learning the mixing weight $\alpha$ between soft and hard labels adaptively per iteration,
Applying temporal smoothing using sequence models (e.g., RNNs, Transformers) across EP rows,
End-to-end optimization with utterance-level loss backpropagated to the segment level,
Generalization to continuous emotion dimensions (arousal/valence) or multi-label targets.

7. Connections to Prior Work and Future Research

EPR builds on the emotion profile modeling framework of Mower et al. (2010) and expands earlier end-to-end EP methods (Mao et al., 2019) by introducing supervised, iterative refinement. It distinguishes itself by offering a principled, generalizable approach to capturing emotion uncertainty and impurity at a segmental level. Future research directions include robust training from sparse or noisy annotation, sequence-level modeling, and systematic treatment of emotion mixtures in more complex and ecologically valid speech settings, as well as application to broader affective computing tasks (Mao et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Emotion Profile Refinery for Speech Emotion Classification (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Emotion Profile Refinery (EPR).