LibriCSS Dataset for Realistic CSS

Updated 27 September 2025

LibriCSS is a dataset that simulates realistic meeting conditions, featuring varied overlapping speech and far-field recordings to benchmark continuous speech separation algorithms.
It is constructed by concatenating utterances from multiple speakers and replaying them through a circular microphone array to capture natural reverberation and spatial cues.
The evaluation protocols integrate ASR accuracy with signal fidelity measures, enabling robust assessment of CSS models under diverse overlap conditions.

LibriCSS is a publicly available dataset constructed to advance continuous speech separation (CSS) research by providing a testbed that more accurately reflects the complexities of real-world conversational speech in meeting environments. The dataset is specifically designed to simulate natural conversations—including both overlapped and overlap-free speech—recorded in realistic far-field conditions with a circular microphone array. Rather than relying on artificially mixed, fully overlapped, and pre-segmented utterances, LibriCSS supports rigorous algorithmic evaluation by tightly coupling separation quality not just to signal fidelity metrics but also to automatic speech recognition (ASR) accuracy.

1. Dataset Construction and Conversation Simulation

LibriCSS is derived from the LibriSpeech corpus by randomly selecting utterances from a pool of 40 “test clean” speakers, simulating conversations by concatenating utterances from (typically) eight speakers. The constructed audio sessions each last approximately one hour and are partitioned into six mini sessions of ten minutes. Each mini session is engineered to exhibit a specific overlap ratio (OVR), which quantifies the degree of concurrent speech as

$\mathrm{OVR} = L_{\mathrm{ovl}} / L_{\mathrm{all}}$

where $L_{\mathrm{ovl}}$ is the total duration of speech overlap and $L_{\mathrm{all}}$ is the total speech duration. Overlap ratios span from 0% (strict turn-taking) up to 40% to model the spectrum of interactional complexity encountered in meetings. The 0% overlap condition includes versions with both short (0.1–0.5 s) and long (2.9–3.0 s) inter-utterance silences to replicate rapid and slow speaker transitions, respectively.

2. Realistic Far-Field Microphone Array Recordings

Uniquely, LibriCSS eschews software-based mixture generation in favor of acoustic replay in a physical meeting room using a circular seven-channel microphone array. Loudspeaker positions are randomized for each session at varied distances (33 cm–409 cm) from the microphones, introducing natural room impulse responses, reverberation, and spatial cues, thereby presenting a challenging separation scenario for CSS algorithms. This approach ensures that the dataset embodies the kind of signal distortions, microphone variability, and spatial information that characterize real recording environments.

3. CSS Task Definition and Algorithmic Evaluation Protocols

Continuous speech separation (CSS) algorithms under LibriCSS must operate on unsegmented, continuous audio, generating a set of clean, non-overlapped speech signals—optimally one per speaker. The evaluation protocol is centered on ASR accuracy rather than pure signal fidelity, based on findings that signal-level metrics such as SDR and SISNR exhibit weak correlation with downstream ASR performance.

A Kaldi-based ASR protocol is adopted:

The acoustic model is a multi-conditional BLSTM (three layers, 512 cells in each direction per layer), trained with 960 hours of LibriSpeech (clean and noisy).
Initial cross-entropy training is followed by MMI sequence training.
A standard 4-gram LLM is used for decoding.

LibriCSS provides two evaluation modalities:

Utterance-wise evaluation: Each utterance is isolated via cross-correlation alignment with close-talking reference signals; separated streams are fed into the ASR, and the lowest WER output per utterance is reported.

Continuous input evaluation: Long streams (60–120 s) are split during periods of silence, CSS outputs are segmented by VAD, and the best match to reference transcripts is scored using tools like Asclite for WER computation.

This dual-mode evaluation enables robust measurement of both the separation algorithm's handling of overlapped speech and its tendency to degrade non-overlapped segments—a critical criterion for conversational transcription systems.

4. Characteristics and Technical Dimensions

LibriCSS is characterized by:

Continuous, far-field, multi-channel audio spanning diverse overlap ratios.
Real reverberant room simulations via acoustic replay rather than digital mixing.
Variable silence durations and overlap structures designed to probe algorithmic performance under dynamic conversational conditions.
Seven-channel circular microphone array recordings offering rich spatial cues pertinent for spatial and beamforming separation methods.

These features enable fine-grained benchmarking across the spectrum of meeting transcription challenges.

5. Interaction with CSS Algorithms and ASR

LibriCSS has stimulated the development of advanced separation and recognition models:

Conformer-based CSS architectures have demonstrated strong WER reductions relative to BLSTM baselines by better exploiting both local and global acoustic context (Chen et al., 2020).
Multi-microphone complex spectral mapping frameworks—combining DNN-based spectral mapping with MVDR beamforming and speaker counting—achieve very low WERs (e.g. 5.8%–7.7% depending on overlap), outperforming prior DNN and conventional methods (Wang et al., 2020).
End-to-end speaker-attributed ASR using transformer models achieves state-of-the-art cpWER (11.9% with target speaker profiles), demonstrating robust monaural attribution in highly overlapped conditions (Kanda et al., 2021).
Joint dereverberation and separation techniques within independence-based frameworks show strong generalization from synthetic mixtures, achieving WERs as low as 10.2% with three-channel input (Saijo et al., 2021).
Recursive models (RSAN) and block-wise dependency mechanisms improve dynamic speaker adaptation and separation correctness, notably outperforming PIT-based methods (Zhang et al., 2021).
Unsupervised neural separators leveraging spatial loss based on DOA–beamforming alignment achieve near supervised performance with carefully clustered DOA estimation (Saijo et al., 2022).
Separator–transducer–segmenter (STS) models yield substantial gains in turn counting accuracy and WER by integrating segmentation and emission regularization (Sklyar et al., 2022).
TF-GridNet combined with mixture encoding provides near-oracle separation quality, with mixture encoders helping to mitigate separation artifacts, though the effect diminishes as separator strength increases (Vieting et al., 2023).
TF-CorrNet models leverage spatial correlation features and PHAT-β transforms to further reduce WER under utterance-wise and continuous evaluation schemes (Shin et al., 20 Sep 2025).

6. Impact on Evaluation and Benchmarking of CSS and Meeting Transcription

The LibriCSS dataset is pivotal in aligning the evaluation of CSS algorithms with real meeting transcription requirements, moving beyond simplistic mixture and signal-level metric paradigms. By directly coupling source separation performance to ASR outcomes in genuine continuous, multi-speaker settings, it provides a meaningful benchmark for measuring progress toward deployable meeting transcription and conversational ASR systems. LibriCSS has become the standard testbed for recent advances in spatially-informed source separation, robust beamforming, and end-to-end multi-party speech recognition.

7. Limitations and Future Prospects

While LibriCSS significantly advances the evaluation of CSS methods, several challenges persist:

The fixed array geometry and simulated conversational structure, while realistic, may not capture the full diversity of spontaneous turn-taking and unstructured meeting interactions.
Certain protocols (e.g., speaker profile availability in ASR evaluation) may not generalize to all real scenarios.
Over-separation, permutation alignment, and single-speaker stream suppression remain active research areas.

A plausible implication is that future iterations may explore adaptation to dynamic array geometries, in-domain training data generation, and more nuanced diarization integration. The continued use and evolution of LibriCSS are likely to further shape algorithm development for robust speech separation and recognition in unconstrained conversational environments.

Markdown Upgrade to Chat

References (9)

Continuous Speech Separation with Conformer (2020)

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation (2020)

End-to-End Speaker-Attributed ASR with Transformer (2021)

Independence-based Joint Dereverberation and Separation with Neural Source Model (2021)

Continuous Speech Separation with Recurrent Selective Attention Network (2021)

Spatial Loss for Unsupervised Multi-channel Source Separation (2022)

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech (2022)

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription (2023)

TF-CorrNet: Leveraging Spatial Correlation for Continuous Speech Separation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LibriCSS Dataset.