Speaker Diarization with LSTM (1710.10468v7)

Published 28 Oct 2017 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

Authors (5)

Quan Wang (130 papers)
Carlton Downey (10 papers)
Li Wan (40 papers)
Philip Andrew Mansfield (9 papers)
Ignacio Lopez Moreno (24 papers)

Citations (305)

View on Semantic Scholar

Summary

Speaker Diarization with LSTM: A Comprehensive Overview

The paper "Speaker Diarization with LSTM" by Quan Wang et al. advances the field of speaker diarization by leveraging LSTM-based d-vector embeddings, a modern departure from traditional i-vector methodologies. Neural networks have increasingly dominated speech and audio processing tasks, and this work solidifies d-vector embeddings as a viable alternative for text-independent speaker diarization.

Overview

Speaker diarization assigns speaker labels to segments of audio, answering "who spoke when." It enhances applications like multimedia retrieval and automatic speech recognition (ASR). Historical methods have involved i-vector embeddings for audio segmentation, but deep learning models, particularly those utilizing d-vectors, have demonstrated superior capabilities in capturing speaker characteristics from audio signals.

This paper extends the d-vector model, emphasizing LSTM architectures' suitability for sequential data processing, which better aligns with the temporal nature of speech. The authors integrate these embeddings with a refined spectral clustering algorithm, contributing to state-of-the-art diarization performance.

Methodology

The diarization system consists of several phases:

Audio Preprocessing: Audio is transformed into frames, where each frame is converted to log-mel-filterbank energies. These are fed into an LSTM model to generate d-vectors.
Embedding Extraction: D-vectors are extracted and normalized before being averaged to represent fixed-length audio segments.
Clustering: The clustering phase is prominent in differentiating speakers. Four algorithms are applied: Naive online clustering, Links online clustering, K-Means offline clustering, and Spectral offline clustering. The latter leverages a novel sequence of affinity matrix refinement operations, highlighting the paper’s innovative contributions in clustering.

Spectral Clustering Refinement: The developed spectral clustering algorithm iteratively refines the affinity matrix. It employs Gaussian blur, thresholding, symmetrization, and diffusion processes. These operations denoise the matrix and enhance inter-speaker distinction, considerably boosting diarization effectiveness.

Experimental Results

The authors tested their system on multiple public datasets—CALLHOME American English, NIST RT-03 CTS, and NIST SRE 2000. Results emphasized the competitive edge of the d-vector model using spectral offline clustering. Notably, a 12.0% diarization error rate (DER) was achieved on the NIST SRE 2000 CALLHOME dataset, underscoring significant improvements over i-vector systems as well as other literature benchmarks. The combination of LSTM-based d-vectors with spectral clustering proved instrumental in pushing the performance envelope in speaker diarization tasks.

Implications and Future Work

The findings of this paper open new avenues for diarization, suggesting broader applications where text-independent speaker identification is necessary. The reliance on LSTM architectures for d-vector extraction underscores a shift toward recurrent models in audio processing, which could steer future research trajectories.

Potential future work includes exploring in-domain training data utilization and integrating resegmentation strategies for enhanced precision. Furthermore, enhancing the robustness of online clustering methods to approach offline performance levels is an evident frontier.

In conclusion, the methodology introduced in this paper exemplifies an effective synthesis of LSTM-based d-vector embeddings and advanced non-parametric clustering techniques, and it marks a pivotal evolution from legacy i-vector methods. As research continues to evolve, such advancements will invariably influence broader AI-driven ecosystems involving speaker verification and recognition.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos