Papers
Topics
Authors
Recent
2000 character limit reached

Spontaneous Conversational Speech Corpus

Updated 23 December 2025
  • Spontaneous conversational speech corpora are collections of unscripted dialogue recordings and transcriptions capturing natural conversational phenomena such as turn-taking, disfluencies, and overlaps.
  • They employ diverse methodologies like in-the-wild sampling, multi-channel far-field recordings, and open licensing to ensure wide speaker, dialect, and environmental representation.
  • These corpora underpin advancements in ASR, TTS, dialogue systems, and sociolinguistic research by providing real-world data with precise annotations and reproducible evaluation metrics.

A spontaneous conversational speech corpus is a systematically collected and annotated set of audio recordings and transcriptions that capture unscripted, interactive spoken language as it naturally occurs in dialogue or multi-party conversation. Unlike read or scripted speech datasets, these corpora are specifically designed to represent the complexities of real-world conversational phenomena: disfluencies, overlaps, turn-taking, prosodic variation, code-switching, and adaptation to social or environmental context. Such resources are foundational for the development and benchmarking of robust automatic speech recognition (ASR), speech synthesis (TTS), dialogue modeling systems, and sociolinguistic research.

1. Corpus Design Principles and Methodologies

Spontaneous conversational speech corpora are distinguished by protocols that deliberately avoid artificiality in both elicitation and recording. Collection methodologies include:

  • Unscripted Dialogue Elicitation: Conversation prompts are typically open-ended topics rather than word-for-word scripts (e.g., meeting discussions in LOTUSDIS (Tipaksorn et al., 23 Sep 2025), game-based negotiation in ding-01 (Kang et al., 18 Aug 2025), dyadic talk in CASPER (Xiao et al., 30 May 2025)).
  • Naturalistic and In-the-wild Sampling: Some projects ingest entire podcast and YouTube domains or radio talk shows to maximize speaker, topic, and dialect diversity (e.g., J-CHAT (Nakata et al., 22 Jul 2024), SwissGPC v1.0 (Stucki et al., 24 Sep 2025)).
  • Multi-channel and Far-field Recording: To capture environmental robustness, corpora like LOTUSDIS use multiple microphones of varying types and positions, preserving device coloration, reverberation, and noise without array processing (Tipaksorn et al., 23 Sep 2025).
  • Task-oriented and Multi-modal Scenarios: For studying task-oriented dialogue, corpora may incorporate multi-modal signals (audio, video, eye-gaze) and collaborative tasks (Spot the Difference corpus (Lopes et al., 2018)).
  • Prompt Diversification: Strategies include image-elicited description (SPIRE-SIES (Singh et al., 2023)), scenario role-play, and environmental variation (SaSLaW (Take et al., 13 Aug 2024)).

Key components in corpus construction are precise speaker metadata, explicit annotation of overlaps/disfluencies, and, where possible, environmental factors.

2. Annotation Schemes and Transcription Conventions

Annotation protocols for spontaneous conversational speech are more complex than those for read speech due to the need to capture:

  • Disfluencies and Fillers: Verbatim transcription of hesitations, fillers (“uh,” “um”), repetitions, repairs, and interrupted starts is standard (LOTUSDIS, SPIRE-SIES, SaSLaW, CORAA).
  • Overlap and Turn-taking: Overlapping speech is annotated via explicit overlap masks (LOTUSDIS: <td> tags, dual audio tracks in (Zhou et al., 4 Sep 2025)), turn-aligned transcripts, and speaker change markers.
  • Non-verbal and Paralinguistic Events: Laughter, sighs, coughs, backchannels, and even visible uncertainty may be coded, especially in multimodal corpora (Lopes et al., 2018, Zhou et al., 4 Sep 2025).
  • Sociolinguistic and Contextual Metadata: Detailed speaker demographics (age, gender, region), topic domains, recording devices, and environmental context are included (SwissGPC, Lahjoita puhetta, SPIRE-SIES).
  • Specialized Tags: Corpora may use corpus-specific tagging schemas—e.g., SGML-style tags for disfluencies in SPIRE-SIES (〈fp〉…〈/fp〉), “tone box” for tone annotation in Isan (Na-Thalang et al., 26 Nov 2025), <sc> for speaker changes in BEA-Dialogue (Gedeon et al., 17 Nov 2025).

Transcription may be orthographic, phonemic, or multimodal (e.g., plain text + IPA + video).

3. Representative Corpora and Key Statistics

A broad array of spontaneous conversational speech corpora cover diverse languages, domains, and technical designs:

Corpus Language(s) Size (h) Speakers Domains Annotation Features
J-CHAT (Nakata et al., 22 Jul 2024) Japanese 68,892 ~20,000 YouTube, podcasts Speaker diarization, large-scale auto-tuned
CASPER (Xiao et al., 30 May 2025) English (global) 158 208 Peer dialogues Disfluencies, overlaps, topic metadata
SwissGPC (Stucki et al., 24 Sep 2025) Swiss German++ 4,979 [N/A] Podcasts, talk shows Dialect ID, phoneme-level weak annotation
LOTUSDIS (Tipaksorn et al., 23 Sep 2025) Thai 114 86 Meetings (far-field) Mic-diverse, overlap-rich, ASR benchmark
SaSLaW (Take et al., 13 Aug 2024) Japanese 2 8 Audio-visual dialog Hearing noise, AV, prosodic/statistical
Lahjoita puhetta (Moisio et al., 2022) Finnish 3,600 20,269 Donate speech app Sociolinguistics, disfluency tags, metadata
SPIRE-SIES (Singh et al., 2023) Indian English 162+ 1,607 Image prompts Disfluency, nativity, VAD, code-switch meta
BEA-Large/Dialogue (Gedeon et al., 17 Nov 2025) Hungarian 255/85 433/188 Interview, dialogue Overlap, role, SOT (speaker order), metadata

Each corpus provides a unique approach: dual-track full-duplex (Zhou et al., 4 Sep 2025), environmental adaptation (SaSLaW), code-switch handling (ASCEND (Lovenia et al., 2021), Isan (Na-Thalang et al., 26 Nov 2025)), spontaneous TTS (Guo et al., 2020, Zhou et al., 4 Sep 2025), or multimodal-multimodal annotation (Spot the Difference (Lopes et al., 2018)).

4. Benchmarking and Evaluation Metrics

Corpus releases now routinely include reproducible ASR and TTS baselines, with detailed error metrics:

  • Word Error Rate: WER=S+D+IN\mathrm{WER} = \frac{S + D + I}{N}, where S = substitutions, D = deletions, I = insertions, N = reference words. Reported WERs on spontaneous speech are significantly higher than on read speech, e.g., LOTUSDIS (Thai, far-field) zero-shot 81.57%, fine-tuned 49.54% for far-field mics (Tipaksorn et al., 23 Sep 2025).
  • Disfluency Robustness: Baseline systems are stress-tested on utterances dense in overlaps or disfluencies (LOTUSDIS, BEA-Dialogue).
  • Speaker Diarization Error Rate (DER): Used in corpora with multi-party conversations and explicit speaker-attribution labels (SwissGPC, BEA-Dialogue).
  • TTS Metrics: Objective metrics such as Spectrum_l2, F0 Wasserstein distance, energy and zero-crossing rates, as well as mean opinion score (MOS) from subjective listening tests, are now standard (Zhou et al., 4 Sep 2025, Take et al., 13 Aug 2024).

Corpus documentation emphasizes the systematic degradation of ASR/TTS performance on spontaneous and far-field conditions, motivating the need for targeted domain-adaptive training and data augmentation.

5. Licensing, Access, and Reproducibility

Most modern spontaneous conversational corpora are released under open licenses, maximizing their utility for academic and commercial research:

  • Open Licenses: Variants of Creative Commons (CC-BY, CC-BY-SA, CC-NC, etc.) are common (LOTUSDIS: CC-BY-SA 4.0, J-CHAT: MIT-style, BEA-Large/Dialogue: research license, SwissGPC: source-forward).
  • Data and Code Availability: Corpora typically provide audio, detailed transcriptions, metadata, baseline system code, and evaluation scripts (LOTUSDIS (Tipaksorn et al., 23 Sep 2025), NURC-SP (Lima et al., 10 Sep 2024), Lahjoita puhetta (Moisio et al., 2022)).
  • Reproducibility: ASR/TTS recipes, data splits, and evaluation sets are released, often with explicit speaker-independent splits and detailed documentation.
  • Limitations: Some corpora provide only download links to source audio (SwissGPC, due to distribution restrictions), or are limited to non-commercial use.

Corpus infrastructure may include semi-automated pipelines for data cleaning, diarization, and multi-modal alignment.

6. Applications, Impact, and Challenges

Spontaneous conversational speech corpora underpin a wide array of research and engineering tasks:

  • Automatic Speech Recognition (ASR): Training and benchmarking ASR for spontaneous, low-resource, code-switched, or dialect-rich settings (LOTUSDIS, J-CHAT, NURC-SP, CORAA, BEA-Large/Dialogue).
  • Text-to-Speech Synthesis (TTS): Enabling prosodically- and contextually-appropriate TTS models capable of producing spontaneous, conversational output ((Zhou et al., 4 Sep 2025, Guo et al., 2020), SaSLaW).
  • Sociolinguistics and Dialectometry: Large-scale studies of regional variation, dialect identification, and adaptation (SwissGPC, Isan, Lahjoita puhetta).
  • Speech Enhancement and Far-field Robustness: Analyzing the impact of distance, device coloration, and reverberation, and informing robust model design (LOTUSDIS).
  • Dialogue Modeling and Human-Machine Interaction: Providing training and evaluation data for spoken dialogue systems, turn-taking models, and interactional grounding.
  • Cross-lingual and Low-Resource Research: Addressing Sparsity in languages and settings with minimal pre-existing resources (BEA-Dialogue, Isan, SwissGPC, SPIRE-SIES).
  • Multi-modal and Environmental Adaptation: Advancing joint audio-visual systems and environment-aware speech technology (Spot the Difference, SaSLaW).

Continuing challenges include annotation consistency (requiring high inter-annotator agreement), scalability to new languages and environments, speaker privacy, as well as documenting device/environmental metadata in detail.

7. Future Directions and Expanding the Resource Landscape

Emerging trends and research imperatives include:

  • Corpus Scale and Diversity: Expansion to tens of thousands of hours (J-CHAT, SwissGPC), broadening speaker, dialect, and demographic representation.
  • Semi-supervised Annotation Pipelines: Weak annotation at scale via ASR and diarization, with targeted manual quality assurance (SwissGPC, J-CHAT, CASPER).
  • Fine-grained Phenomenon Annotation: Explicit encoding of prosody, disfluency, environmental changes (SaSLaW), conversational structure (ding-01), and code-switch events (ASCEND, Isan).
  • Downstream Robustness: Development of ASR/TTS systems that are more resistant to spontaneous, far-field, code-switched, and overlapped speech; evaluation on increasingly challenging benchmarks.
  • Open, Reproducible Infrastructure: Release of not only data and code, but also detailed recipes, multi-modal assets (audio, video, sensor streams), and full metadata to ensure sustainable and extensible research impact.

A plausible implication is that as spontaneous conversational corpora become larger and more richly annotated, they will drive the next generation of robust, context-aware speech and dialogue systems capable of operating in unconstrained, real-world environments.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Spontaneous Conversational Speech Corpus.