Switchboard-Affect (SWB-Affect)
- SWB-Affect is an annotated speech emotion dataset derived from the Switchboard corpus, capturing both categorical and dimensional emotion cues in natural conversations.
- It integrates dual emotion representation by combining Ekman’s categorical model with dimensional valence, activation, and dominance scales for nuanced analysis.
- The dataset employs lexical and paralinguistic cue analysis and benchmarks SER models, advancing research in real-world human-computer interaction.
Switchboard-Affect (SWB-Affect) is an annotated speech emotion dataset derived from the Switchboard-1 Release 2 corpus, capturing both categorical and dimensional emotion perception in naturalistic telephone conversations. The SWB-Affect framework serves as a benchmark and resource for research in speech emotion recognition (SER), explicitly addressing challenges arising from the use of spontaneous, non-acted conversational speech and providing carefully curated label sets oriented toward real-world applications.
1. Dataset Origins and Annotation Protocol
SWB-Affect is based on the Switchboard-1 Release 2 corpus, comprising 260 hours of speech from 543 speakers, representative of unconstrained conversational English. From this source, 10,000 segments (approx. 25 hours) were selected according to strict criteria: removal of poor audio quality, exclusion of very short utterances (under 5 seconds or 5 words), and avoidance of segments exhibiting shifting emotion over time (above 15 seconds). Neutral dominance in the sample was controlled by stratified selection using bins derived from model-inferred activation, valence, and dominance levels.
Annotations were performed by trained crowd graders certified via agreement with a gold standard set of 20 segments and guided by a detailed protocol. Each audio segment received ratings from six independent annotators for both categorical emotion selection and dimensional Likert-scale judgments. The categorical scheme incorporates Ekman’s seven universal emotions (anger, disgust, fear, happiness, sadness, surprise, contempt), “tenderness,” “calmness”, and “neutral.” Dimensional annotation targets activation, valence, and dominance, each on a 5-point scale:
$\begin{array}{c|c|c} \textbf{Dimension} & \textbf{Low (1)} & \textbf{High (5)} \ \hline \text{Valence} & \text{Negative} & \text{Positive} \ \text{Activation} & \text{Drained} & \text{Energized} \ \text{Dominance} & \text{Weak} & \text{Strong} \ \end{array}$
Annotators were instructed to consider both lexical content (“what” is said) and paralinguistic features (“how” it is said), with voice descriptors supplied for each category.
Krippendorff’s alpha was the principal metric for inter-annotator agreement, with values around 0.25 for primary categorical selections and 0.40–0.53 for dimensional ratings, underscoring the inherent complexity and subjectivity of perceptual emotion labeling.
2. Dimensional and Categorical Model Integration
The SWB-Affect framework, consistent with foundational work on affect representation (Lyons, 2017), leverages the compatibility between dimensional and categorical models. Dimensional representations, such as valence-arousal-dominance spaces, permit fine-grained modeling of affect:
- Each input (e.g., a speech segment or facial image ) is mapped via (or for VAD), where encodes the affect dimensions.
- Categorical assignment is achieved by clustering within this space: given cluster centers , assignment uses .
This dual representation informs both the creation of interpretable affective labels and the development of robust recognition systems. Dimensionality reduction applied to perceptual data facilitates the emergence of recognizable emotion categories, allowing systems to operate in both regression and classification regimes.
3. Lexical and Paralinguistic Cue Analysis
The SWB-Affect project systematically analyzed the role of lexical and paralinguistic information in affect perception.
Lexical cues were evaluated by prompting the GPT-4o model with transcript-based tasks, yielding average detection probabilities across emotions. Fear achieved the highest probability (), while calmness, tenderness, and happiness registered substantially lower values (), indicating modest lexical contribution for these classes.
Paralinguistic cues were extracted using features such as pitch, loudness, rhythm (word and pause rate), and spectral centroid via the Librosa toolkit. Happy and surprised expressions generally exhibited increased pitch and loudness, while sadness correlated with reduced pitch and energy. Anger, contempt, and disgust, though commonly confused, showed elevated loudness; only anger and contempt reliably correlated with raised pitch and speaking rate, while contempt and disgust were marked by more frequent pausing. Statistical significance was assessed by Wilcoxon testing and multiple comparison correction.
This suggests that successful SER systems must capture the differential contributions of acoustic and lexical cues, particularly for subtle affective states.
4. Speech Emotion Recognition Model Evaluation
Several state-of-the-art SER models were benchmarked against SWB-Affect:
- Emotion2Vec: Universal representation, pre-trained on unlabeled emotional data.
- Audeering W2V2: Based on wav2vec 2.0, fine-tuned on MSP-Podcast labels.
- Odyssey: Built on WavLM, trained on MSP-Podcast consensus labels.
- Whisper-GRU: GRU network over frozen Whisper embeddings.
- GPT-4o: Zero-shot, audio-based inference via multitask prompting.
In the categorical task, GPT-4o achieved the highest overall F1 average (0.391), with especially strong performance for anger, sadness, and fear. Odyssey delivered the highest recall but underpredicted neutral samples. A consistent challenge emerged for recognition of anger, with marked declines in F1 and recall compared with performance on acted datasets such as MSP-Podcast. This poor generalization is attributed to attenuated expression of anger cues in natural conversation.
Dimensional prediction was assessed using Lin’s Concordance Correlation Coefficient (CCC); Odyssey and Audeering W2V2 performed best on valence and activation/dominance, respectively.
5. Applications in Human-Computer Interaction and Dialog Systems
The SWB-Affect resource informs the development of affect-sensitive dialog and interaction systems by facilitating nuanced analysis and affect-aware synthesis. In dialog generation, techniques such as AffectON (Bucinca et al., 2020) steer LLM output toward target affect vectors using affective lexicons, continuous affective spaces, and probabilistic fusion schemes:
Given candidate word coordinates and a target affect , selection is shaped by the distance
and fusion of model probability distributions: where is the LLM’s output, is the affect-derived probability, and tunes affect strength.
User interface and evaluation protocols for subjective affect, employing 0–100 sliders and dimension-specific presentation, provide rigorous means to assess generated dialog against human perception. The empirical demonstration that affective modulation shifts raters’ perception significantly toward targeted affect values substantiates the utility of SWB-Affect-derived affect representations in interactive NLP systems.
6. Implications, Research Directions, and Resource Importance
The findings associated with SWB-Affect have several implications for SER research. Spontaneous, naturalistic speech datasets reveal shortcomings in existing models, particularly in the recognition of emotions such as anger, which are more subtly expressed than in acted corpora. This suggests that model architectures should explicitly integrate both lexical and paralinguistic modalities, and that multimodal fusion strategies may advance the state of the art.
The granular release of annotations—including consensus, individual, and guideline-derived labels—position SWB-Affect as a critical benchmark for evaluating SER in authentic communicative contexts. Transparency in annotation and documented grader training, coupled with publicly available labels, invites robust comparative studies, model refinement, and exploration of demographic modulations (e.g., effects of speaker sex, age, and conversational context on emotion perception).
A plausible implication is that the deployment of affect-sensitive dialog systems in real-world applications should involve ongoing calibration against datasets such as SWB-Affect, which more accurately reflect the complexities and ambiguities of natural expression.
7. Visualizations and Statistical Metrics
The dataset is accompanied by distributional analyses and visualizations:
- Label distributions highlight the frequency of primary and secondary emotion assignments.
- Co-occurrence matrices reveal inter-rater confusion, with notable overlap between anger and contempt.
- Acoustic feature comparisons statistically differentiate emotion classes using Wilcoxon tests and Benjamini–Hochberg correction.
Agreement statistics underscore the subjective variability of emotion labeling, with moderate concordance for dimensional ratings and lower agreement for categorical selections. These metrics inform best practice both in dataset curation and in downstream evaluation, promoting rigorous standards for SER benchmarking in academic and applied settings.