Papers
Topics
Authors
Recent
Search
2000 character limit reached

AffectSpeech Dataset Overview

Updated 3 June 2026
  • AffectSpeech is a large-scale, multi-corpus dataset of 253,799 utterances annotated along six affective dimensions for speech emotion captioning and synthesis.
  • It employs a human–LLM collaborative annotation pipeline, combining algorithmic pre-labeling with expert verification to ensure high-quality, multi-style descriptive labels.
  • Complementary variants, including an EMG corpus for silent speech and iMiGUE-Speech for spontaneous interviews, extend its applicability in affective research.

AffectSpeech refers to multiple distinct resources in affective speech research. The three primary datasets with this name share a focus on fine-grained annotation and robust evaluation for affective and emotional speech modeling, but differ substantially in scale, modality, annotation scheme, and target applications. The principal AffectSpeech corpus, introduced in 2026, is a large-scale dataset for speech emotion captioning and synthesis (Qi et al., 5 Apr 2026). The term has also been applied to (i) a surface EMG corpus for silent and phonated affective speech (Pistrosch et al., 12 Mar 2026), and (ii) the iMiGUE-Speech dataset—a spontaneous, naturalistic resource for speech-driven affective analysis (Kakouros et al., 25 Feb 2026). This article details these resources with an emphasis on the principal large-scale corpus and provides technical comparisons and context.

1. Large-Scale AffectSpeech Corpus: Composition and Annotation Schema

The principal AffectSpeech dataset is a large-scale corpus unifying 253,799 English-language utterances with detailed, multi-dimensional natural language annotations (Qi et al., 5 Apr 2026). Its construction aggregates speech from nine public corpora (SAVEE, RAVDESS, eNTERFACE, TESS, CREMA-D, ESD, MEAD, MESS, subsets of VCTK-Corpus and LibriTTS), resulting in a balanced gender split (approximately 50% female/male). Utterances span adult speakers across acted and spontaneous settings.

Each utterance is annotated along six complementary dimensions:

  • Sentiment Polarity: {positive, negative, neutral}; serves as a coarse-grained valence anchor.
  • Open-Vocabulary Emotion Captions: Free-form descriptions, e.g., “She speaks with trembling urgency, betraying her underlying fear,” enabling nuanced affective labeling.
  • Emotion Intensity: Discrete scale (1–5) reflecting expressivity strength.
  • Prosodic Attributes: Quantized F₀ (fundamental frequency), speaking rate, and short-time energy, each labeled as {low, normal, high} based on corpus-wide statistics.
  • Prominent Segment: Temporal interval within the utterance (beginning, medial, final, entire, unclear) where affect is most salient.
  • Semantic Content: Top emotion-related keywords per utterance.

A unique feature is the systematic reformulation of each core annotation into six distinctive descriptive styles, including narrative, profiling, synopsis, bullet points, technical prose, and hierarchical (“structural”) style.

2. Human–LLM Collaborative Annotation Pipeline

Annotation employs a human–LLM collaborative protocol that maximizes quality and diversity while maintaining scalability (Qi et al., 5 Apr 2026):

  1. Algorithmic Pre-Labeling: Extraction of prosodic features (pitch, tempo, energy), preliminary emotion coordinates (valence, arousal, dominance via pretrained SER model), and ingestion of metadata.
  2. Multi-LLM Description Generation: Prompting ensemble LLMs (e.g., GPT-based and Qwen-based) to generate full-dimension captions and attribute values for each utterance.
  3. Adjudication and Human Verification: An LLM-based adjudicator extracts “keypoints” (e.g., sentiment, segment, prosody) for consistency. If candidates agree, the more fluent is retained; disagreements are resolved by expert raters via selection or correction/regeneration if both fail. Verification outcomes show 64.9% direct consensus, 28.9% manual selection, and 6.2% manual correction over 253,799 utterances.

3. Statistical Properties and Coverage

AffectSpeech offers broad coverage of affective dimensions:

Dimension Statistics / Key Properties
Utterances 253,799
Speakers Balanced ≈50% female/male; adult; multiple corpora
Emotion Labels 9 categories: angry, disgust, fear, happy, neutral, sad, surprise, contempt, calm
Sentiment P(positive) ≈ 0.35; P(negative) ≈ 0.30; P(neutral) ≈ 0.35
Intensity Clustered at levels 2–3 (“subtle” expression)
Prosody F₀ ~175 Hz (male), ~250 Hz (female); tempo mean ≈3.2 words/s
Segment Salience Beginning 28.2%; medial 24.7%; final 22.4%; entire 21.4%
Utterance Length 95% < 15.2 s; median ≈3.8 s
Annotations 1,522,794 caption variants (6 per utterance)

These distributions suggest robust sampling of both coarse and fine-grained affective variables. Multi-style annotation (narrative, technical, etc.) is designed to discourage stylistic overfitting in downstream models.

4. Data Format, Licensing, and Example Entries

All data are stored in JSON per utterance. Each record encodes audio path, speaker metadata, categorical and continuous affective attributes, prosody, segmental prominence, keyword lexicon, and enriched annotation texts in six styles. Sample schema (abbreviated):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  "id": "utt_000123",
  "audio_path": ".../utt_000123.wav",
  "speaker_id": "spk_045",
  "gender": "female",
  "emotion_label": "fear",
  "sentiment": "negative",
  "intensity": 4,
  "prosody": { "pitch": "high", "tempo": "normal", "energy": "high" },
  "prominent_segment": "beginning",
  "keywords": ["trembling", "urgency"],
  "captions": {
    "narrative": "...",
    "profile": "...",
    "synopsis": "...",
    "bullet": [ ... ],
    "technical": "...",
    "structural": "..."
  }
}

Licensing abides by the original audio corpus terms (e.g., CC BY-NC for CREMA-D, CC BY for LibriTTS), and annotation redistributions are under CC BY-4.0.

5. Evaluation Protocols and Use Cases

Recommended tasks and evaluation protocols include:

  • Speech Emotion Captioning (SEC): Supervised fine-tuning of multimodal LLMs to generate free-form text from speech input.
  • Controllable Emotional Speech Synthesis (ESS): Text-based conditioning for expressive TTS.
  • Affective Analysis: Studies of prosodic–semantic interaction via multi-granular annotation.
  • Transfer Learning: Source for rich paralinguistic supervision in downstream SER, TTS, and retrieval tasks (Qi et al., 5 Apr 2026).

6. Alternative Datasets Named "AffectSpeech"

6.1. Surface EMG AffectSpeech Dataset

SAIL-TUM Corpus on Affective Speech EMG (“AffectSpeech”/ST-CASE) (Pistrosch et al., 12 Mar 2026) addresses affect in articulatory muscle signals during both phonated and silent (non-vocalized) speech.

  • Participants: 12 adults (9 female, 3 male), C2 English proficiency, 2780 utterances (phonated and silent).
  • Modalities: 8-channel facial/neck surface EMG (10 kHz), synchronized audio (for phonated trials).
  • Affect Conditions: Politeness, frustration, neutrality; labeled by task design (Tasks 1,3) or expert rater (Task 2; Krippendorff’s α ≈ 0.65–0.75).
  • Features: Handcrafted summary statistics (rectified, RMS, spectral), time–domain bands, and learned 128-D per-channel embeddings.
  • Evaluation: Intra-subject 5-fold cross-validation; inter-subject LOSO; various articulation-mode transfer tasks.

The primary applications are affect-aware silent speech interfaces, multimodal emotion recognition integrating EMG, and study of neuromuscular correlates of paralinguistic cues.

6.2. iMiGUE-Speech (AffectSpeech) Dataset

iMiGUE-Speech (Kakouros et al., 25 Feb 2026), also referred to as AffectSpeech, is the first public corpus of spontaneous press-conference speech for affective analysis:

  • Composition: 359 press-interviews (5–15min), 72 athletes from 28 nations.
  • Affect Labels: Derived weakly from tennis match outcome (win/lose); no manual emotion annotation.
  • Modalities: Single-channel audio, ASR (Whisper-Large), diarization (pyannote.audio), word/phone alignment (Montreal Forced Aligner), speaker-role attribution (athlete/journalist).
  • Benchmarks: Speech emotion recognition (SER) using Wav2Vec 2.0 and WavLM; transcript-based sentiment analysis (RoBERTa); comparison of group-level affect statistics (arousal, dominance, valence).
  • Multimodality: Linkable to the iMiGUE gesture dataset for synchronized micro-gesture analysis.

A key advantage is spontaneous, unscripted emotional expressivity in naturalistic high-stakes settings; however, label granularity remains coarse (valence by match outcome), and there are domain and transcriptional limitations.

7. Comparative Considerations and Research Impact

AffectSpeech (large-scale) stands out as the first corpus unifying large-scale emotional speech with free-form, multi-style, multi-attribute captioning anchored in verified human–LLM annotation (Qi et al., 5 Apr 2026). The EMG and iMiGUE-Speech variants serve more specialized domains: sensor-based affect analysis (EMG) and spontaneous sport interview affect (iMiGUE-Speech), each promoting research into affective expression beyond acted datasets (cf. IEMOCAP, EMO-DB).

AffectSpeech resources collectively facilitate advances in:

  • SEC/ESS, by enabling open-vocabulary, style-diversified modeling;
  • Multimodal and sensor-vocal affect fusion;
  • Robust paralinguistics research, including silent speech and spontaneous affect contexts;
  • Transfer learning, due to coverage of diverse demographics, affective states, annotation styles, and modalities.

The integration of high-quality, fine-grained, and stylistically diverse descriptive annotations with large-scale, multi-source audio positions AffectSpeech as a cornerstone for the next generation of affective speech technologies and research directions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AffectSpeech Dataset.