PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings (2403.02288v2)
Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.
- Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP, 2020.
- “Continuous speech separation: Dataset and analysis,” in ICASSP, 2020.
- “Unsupervised sound separation using mixture invariant training,” in NeurIPS, 2020.
- “Adapting speech separation to real-world meetings using mixture invariant training,” in ICASSP, 2022.
- “Self-supervised learning-based source separation for meeting data,” in ICASSP, 2023.
- “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks,” Computer Speech and Language, vol. 71, pp. 101254, 2022.
- “End-to-end neural speaker diarization with permutation-free objectives,” in Interspeech, 2019.
- “End-to-end neural speaker diarization with self-attention,” in ASRU, 2019.
- “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,” in Interspeech, 2021.
- “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in ICASSP, 2021.
- “A deep analysis of speech separation guided diarization under realistic conditions,” in APSIPA ASC, 2021.
- “Low-latency speech separation guided diarization for telephone conversations,” in SLT, 2022.
- “TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1185–1197, 2024.
- “GPU-accelerated guided source separation for meeting transcription,” in Interspeech, 2023.
- “All-neural online source separation, counting, and diarization for meeting analysis,” in ICASSP, 2019.
- “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in SLT, 2022.
- “The AMI meeting corpus: A pre-announcement,” in ICMI, 2005.
- “M2Met: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in ICASSP, 2022.
- Yi Luo and Nima Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in ICASSP, 2018.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “SDR – half-baked or well done?,” in ICASSP, 2019.
- Hervé Bredin, “pyannote.audio 2.1 speaker diarization pipeline: Principle, benchmark, and recipe,” in Interspeech, 2023.
- “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Interspeech, 2020.
- “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
- “On word error rate definitions and their efficient computation for multi-speaker speech recognition systems,” in ICASSP, 2023.
- “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings,” in CHiME 2020, 2020.
- “The Kaldi speech recognition toolkit,” in ASRU, 2011.
- “Multiple dimension Levenshtein edit distance calculations for evaluating automatic speech recognition systems during simultaneous speech.,” in LREC’06, 2006.
- “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
- “NeMo: a toolkit for building AI applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
- “Open automatic speech recognition leaderboard,” https://huggingface.co/spaces/huggingface.co/spaces/open-asr-leaderboard/leaderboard, 2023.
- “WhisperX: Time-accurate speech transcription of long-form audio,” Interspeech, 2023.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
- “Powerset multi-class cross entropy loss for neural speaker diarization,” in Interspeech, 2023.
- “Asteroid: the PyTorch-based audio source separation toolkit for researchers,” in Interspeech, 2020.
- “Adam: A method for stochastic optimization,” CoRR, 2014.
- “Overlap-aware end-to-end supervised hierarchical graph clustering for speaker diarization,” arXiv preprint arXiv:2401.12850, 2024.