Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations (2303.12002v3)

Published 21 Mar 2023 in eess.AS, cs.LG, and cs.SD

Abstract: Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Speaker diarization: A review of recent research. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing 20, 356–370.
  2. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 .
  3. End-to-end speaker segmentation for overlap-aware resegmentation, in: Proc. of Interspeech, ISCA. pp. 3111–3115.
  4. Pyannote.audio: neural building blocks for speaker diarization, in: Proc. of ICASSP, IEEE. pp. 7124–7128.
  5. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection, in: Proc. of ICASSP, IEEE. pp. 7114–7118.
  6. Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation, in: Proc. of Interspeech, ISCA. pp. 2642–2646.
  7. Continuous speech separation: Dataset and analysis, in: Proc. of ICASSP, IEEE. pp. 7284–7288.
  8. The Fisher Corpus: A resource for the next generations of speech-to-text., in: LREC, pp. 69–71.
  9. Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation, in: Proc. of ASRU, IEEE. pp. 1139–1146.
  10. Overlapped speech detection and speaker counting using distant microphone arrays. Computer Speech & Language 72, 101306.
  11. Librimix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262 .
  12. Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing 19, 788–798.
  13. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, in: Proc. of Interspeech, ISCA. pp. 3830–3834.
  14. A deep analysis of speech separation guided diarization under realistic conditions, in: Proc. of APSIPA ASC, pp. 667–671.
  15. End-to-end neural speaker diarization with permutation-free objectives, in: Proc. of Interspeech, ISCA. pp. 4300–4304.
  16. End-to-end neural speaker diarization with self-attention, in: Proc. of ASRU, IEEE. pp. 296–303.
  17. End-to-end neural diarization: Reformulating speaker diarization as simple multi-label classification. arXiv preprint arXiv:2003.02966 .
  18. Speaker diarization using deep neural network embeddings, in: Proc. of ICASSP, IEEE. pp. 4930–4934.
  19. BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers, in: Proc. of ICASSP, IEEE. pp. 7193–7197.
  20. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, in: Proc. of Interspeech, ISCA. pp. 269–273.
  21. End-to-end speaker diarization as post-processing, in: Proc. of ICASSP, IEEE. pp. 7188–7192.
  22. Towards neural diarization for unlimited numbers of speakers using global and local attractors, in: Proc. of ASRU, IEEE. pp. 98–105.
  23. Online neural diarization of unlimited numbers of speakers using global and local attractors. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 706–720.
  24. The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap. arXiv preprint arXiv:2102.01363 .
  25. Speaker diarization with region proposal network, in: Proc. of ICASSP, IEEE. pp. 6514–6518.
  26. Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network, in: Proc. of Interspeech, ISCA. pp. 3086–3090.
  27. Adam: A method for stochastic optimization, in: Proc. of ICLR.
  28. Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech, in: Proc. of Interspeech, ISCA. pp. 3565–3569.
  29. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds, in: Proc. of ICASSP, IEEE. pp. 7198–7202.
  30. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing 25, 1901–1913.
  31. TitaNet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context, in: Proc. of ICASSP, IEEE. pp. 8102–8106.
  32. Analysis of the BUT diarization system for VoxConverse challenge, in: Proc. of ICASSP, IEEE. pp. 5819–5823.
  33. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language 71, 101254.
  34. SDR–half-baked or well done?, in: Proc. of ICASSP, IEEE. pp. 626–630.
  35. Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, in: Proc. of ICASSP, IEEE. pp. 46–50.
  36. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing 27, 1256–1266.
  37. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario, in: Proc. of Interspeech, ISCA. pp. 274–278.
  38. Low-latency speech separation guided diarization for telephone conversations, in: Proc. of SLT, IEEE. pp. 641–646.
  39. Conversational speech separation: an evaluation study for streaming applications, in: Proc. of AES Convention 152, AES.
  40. Voice separation with an unknown number of multiple speakers, in: Proc. of ICML, PMLR. pp. 7164–7175.
  41. End-to-end training of time domain audio separation and recognition, in: Proc. of ICASSP, IEEE. pp. 7004–7008.
  42. Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers, in: Proc. of Interspeech, ISCA. pp. 2637–2641.
  43. Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27, 381–385.
  44. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language 72, 101317.
  45. JHU ASpIRE system: Robust LVCSR with TDNNs, iVector adaptation and RNN-LMS, in: Proc. of ASRU, IEEE. pp. 539–546.
  46. The Kaldi speech recognition toolkit, in: Proc. of ASRU, IEEE.
  47. 2000 NIST Speaker Recognition Evaluation LDC2001S9. URL: https://catalog.ldc.upenn.edu/LDC2001S97.
  48. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis, in: Proc. of SLT, IEEE. pp. 897–904.
  49. Multi-class spectral clustering with overlaps for speaker diarization, in: Proc. of SLT, IEEE. pp. 582–589.
  50. Self-supervised metric learning with graph clustering for speaker diarization, in: Proc. of ASRU, IEEE. pp. 90–97.
  51. REAL-M: Towards speech separation on real mixtures, in: Proc. of ICASSP, IEEE. pp. 6862–6866.
  52. Deep multi-frame MVDR filtering for single-microphone speech enhancement, in: Proc. of ICASSP, IEEE. pp. 8443–8447.
  53. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing 26, 1702–1726.
  54. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings, in: Proc. of 6th International Workshop on Speech Processing in Everyday Environments, pp. 1–7.
  55. Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020, in: Proc. of ICASSP, IEEE. pp. 5824–5828.
  56. Online streaming end-to-end neural diarization handling overlapping speech and flexible numbers of speakers, in: Proc. of Interspeech, ISCA. pp. 3116–3120.
  57. Online end-to-end neural diarization with speaker-tracing buffer, in: Proc. of SLT, IEEE. pp. 841–848.
  58. DIVE: End-to-end speech diarization via iterative speaker embedding, in: Proc. of ASRU, IEEE. pp. 702–709.
Citations (3)

Summary

We haven't generated a summary for this paper yet.