Emotion Profile Refinery (EPR)
- EPR is an iterative segment-level soft-label refinement approach that captures the temporal dynamics and mixtures of emotions in speech.
- It uses pseudo-one-hot regularization and iterative retraining, achieving significant accuracy improvements over standard hard-label methods.
- The method leverages VGG-style CNNs on log-Mel spectrogram segments, enabling detailed analysis of intra-utterance emotion fluctuations.
The Emotion Profile Refinery (EPR) is an iterative, segment-level soft-label refinement procedure designed to address the intrinsic ambiguity and impurity of human emotions in speech. Rather than constraining emotion recognition to a single, hard categorical label per utterance, EPR models the temporal evolution and mixtures of emotions by producing probabilistic emotional profiles across speech segments. By iteratively re-training segment-level classifiers with dynamically updated soft labels—optionally regularized with hard targets—EPR achieves significant accuracy improvements in speech emotion classification, as demonstrated on several benchmark corpora (Mao et al., 2020).
1. Emotional Profiles: Definition and Motivation
The EPR framework is motivated by the recognition that human emotions in speech rarely present as pure, temporally homogeneous states. Emotional impurity refers to the phenomenon where utterances contain mixtures of basic emotions, varying within and across segments. Hard-label annotation, the prevailing paradigm in most SER systems, fails to capture this intra-utterance variability. EPR instead employs Emotional Profiles (EPs), which consist of a sequence of segment-level probability vectors, each reflecting the soft assignment over emotion classes for a given segment.
For an utterance segmented into contiguous segments, the emotional profile is constructed as:
Each row of represents the temporal trajectory of the probability assigned to a particular emotion class throughout the utterance. Using EPs enables the system to:
- Capture intra-utterance emotion dynamics,
- Represent soft mixtures rather than enforcing hard categorical choices,
- Provide richer features for downstream utterance-level classification via statistics such as mean, variance, and percentiles of the emotion probability trajectories.
2. Emotion Profile Refinery (EPR): Algorithmic Procedure
EPR is formulated as an iterative learning scheme over segment-level classifiers. Each round of refinement leverages the predictions from the previous round (the "refinery stage"), updating the target labels for each segment, and mitigating the limitations of relying solely on hard utterance-level labels.
Algorithmic Steps
The EPR process is specified as follows:
- Initialization: Each utterance is segmented into segments , and each segment is initially labeled with the pseudo one-hot utterance label .
- Iterative Refinement ( to ):
a. Train segment-level classifier on all segments using targets .
- For : (hard label).
- For :
- Standard EPR (sEPR): (pure soft label).
- Pseudo-one-hot EPR (pEPR): , i.e., summing the prior round's soft output with the original label and renormalizing.
b. Optimize by minimizing the cross-entropy loss:
c. After each stage, update all segment predictions.
- Output: The final emotional profile for is assembled from the last-stage predictions: .
- Utterance-level Aggregation: EPs are reduced to fixed-length feature vectors using statistics (mean, std, percentiles), subsequently used to train a Random Forest classifier for utterance-level prediction.
This design allows segment-level distributions to evolve over multiple stages, with the pEPR variant providing regularization against label collapse.
3. Segment-Level Classifier Architecture
The segment-level classifier operates on short-term log-Mel spectrogram representations ("images") constructed from $32$ frames ( ms) and 64 Mel bins. Each segment is processed through a VGG-style convolutional architecture ("configuration E"):
- Conv(64, 3×3) – Conv(64, 3×3) – Pool(2×2)
- Conv(128, 3×3) – Conv(128, 3×3) – Pool(2×2)
- Conv(256, 3×3)×3 – Pool(2×2)
- Conv(512, 3×3)×3 – Pool(2×2)
- Conv(512, 3×3)×3 – Pool(2×2)
- Fully connected (4096) – Fully connected (4096) – Fully connected () – Softmax
Training employs Adam optimization (lr = 0.001, decay 0.8 every 2 epochs), batch size 128, for up to 20 epochs with early stopping (patience = 3). The cross-entropy loss is reused for segment-level training at each refinement stage.
4. Experimental Protocol and Datasets
EPR was evaluated on three public benchmark datasets:
| Corpus | Language | Utterances | Emotion Classes | Speakers |
|---|---|---|---|---|
| CASIA | Mandarin | 7,200 | 6 | 4 |
| Emo-DB | German | 535 | 7 | 10 |
| SAVEE | English | 480 | 7 | 4 |
Frames were extracted using STFT (window 25 ms, hop 10 ms, FFT=512), mapped to 64 log-Mel bins, and grouped into 32-frame segments. Segment-level training and utterance-level classification both used 10-fold cross-validation. Evaluation metrics included Weighted Accuracy (WA) and Unweighted Accuracy (UA), with WA = UA for balanced classes (CASIA).
5. Quantitative and Qualitative Results
Quantitative Findings
| Model | CASIA (WA/UA) | Emo-DB (WA) | Emo-DB (UA) | SAVEE (WA) | SAVEE (UA) |
|---|---|---|---|---|---|
| Baseline VGG | 93.10% | 83.00% | 82.36% | 70.63% | 69.88% |
| sEPR (soft labels) | 93.67% | — | — | — | — |
| pEPR (hard+soft labels) | 94.83% (2nd) | 88.04% (5th) | — | 77.08% (4th) | — |
- sEPR: Small gains in a single iteration, followed by performance collapse (predictions drift toward uniformity).
- pEPR: Larger, sustained gains; best performance generally achieved at early-intermediate iteration (CASIA, 2nd; Emo-DB, 5th; SAVEE, 4th).
Ablation analysis found that pseudo-one-hot assisted (pEPR) provides the largest gain, while dynamic argmax labels and static soft labels resulted in minor improvement or degradation, respectively. No formal statistical significance testing was reported.
Qualitative Observations
- After two sEPR refinements, class probabilities collapse to $1/K$ (loss of discriminative power).
- pEPR maintains sharp peaks correlated with the ground truth and exhibits plausible temporal emotion fluctuations across segments.
- Post-pEPR, confusion matrices display reduced off-diagonal confusions, e.g., between anger and frustration.
6. Discussion: Advantages, Limitations, and Extensions
EPR leverages fine-grained, segment-level probabilistic supervision, encoding temporal and mixture uncertainty in speech emotion classification. By introducing an iterative learning procedure with hard-label regularization (pseudo-one-hot), EPR prevents over-smoothing and overfitting associated with refining on pure soft labels alone.
Primary limitations include computational overhead from repeated segment-level training, as well as a dependency on 1:1 mixing of hard and soft labels (fixed heuristic weighting). Evaluation is currently limited to acted, small-scale corpora; extension to naturalistic and larger datasets may require further adaptation.
Proposed extensions include:
- Learning the mixing weight between soft and hard labels adaptively per iteration,
- Applying temporal smoothing using sequence models (e.g., RNNs, Transformers) across EP rows,
- End-to-end optimization with utterance-level loss backpropagated to the segment level,
- Generalization to continuous emotion dimensions (arousal/valence) or multi-label targets.
7. Connections to Prior Work and Future Research
EPR builds on the emotion profile modeling framework of Mower et al. (2010) and expands earlier end-to-end EP methods (Mao et al., 2019) by introducing supervised, iterative refinement. It distinguishes itself by offering a principled, generalizable approach to capturing emotion uncertainty and impurity at a segmental level. Future research directions include robust training from sparse or noisy annotation, sequence-level modeling, and systematic treatment of emotion mixtures in more complex and ecologically valid speech settings, as well as application to broader affective computing tasks (Mao et al., 2020).