Speaker Recognition in the Wild

Published 5 May 2022 in cs.SD, cs.CL, and eess.AS | (2205.02475v1)

Abstract: In this paper, we propose a pipeline to find the number of speakers, as well as audios belonging to each of these now identified speakers in a source of audio data where number of speakers or speaker labels are not known a priori. We used this approach as a part of our Data Preparation pipeline for Speech Recognition in Indic Languages (https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). To understand and evaluate the accuracy of our proposed pipeline, we introduce two metrics: Cluster Purity, and Cluster Uniqueness. Cluster Purity quantifies how "pure" a cluster is. Cluster Uniqueness, on the other hand, quantifies what percentage of clusters belong only to a single dominant speaker. We discuss more on these metrics in section \ref{sec:metrics}. Since we develop this utility to aid us in identifying data based on speaker IDs before training an Automatic Speech Recognition (ASR) model, and since most of this data takes considerable effort to scrape, we also conclude that 98\% of data gets mapped to the top 80\% of clusters (computed by removing any clusters with less than a fixed number of utterances -- we do this to get rid of some very small clusters and use this threshold as 30), in the test set chosen.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents an unsupervised pipeline using deep embeddings and HDBSCAN to cluster speakers from unlabeled audio recordings.
It employs Voice Activity Detection and partial segmentation to produce 256-dimensional embeddings, successfully applying an English-trained model to Indic language data.
Results confirm high effectiveness with 96% Cluster Purity and 84.81% Cluster Uniqueness, demonstrating robust performance for multilingual speaker recognition.

Speaker Recognition in the Wild: An Analytical Overview

The paper "Speaker Recognition in the Wild" introduces an unsupervised pipeline specifically designed for speaker recognition when the number of speakers and labels are unknown. This approach is particularly tailored for processing audio data in Indic languages and uses speech clustering to assign speaker labels. This method circumvents the need for supervised datasets, making it a valuable contribution to scenarios where metadata is sparse or absent.

Methodology

The methodology centers around an unsupervised clustering technique known as Speaker Clustering. This involves identifying unique speakers from a batch of audio recordings without predefined labels. The authors employ Voice Activity Detection to segment the audios into shorter chunks (or utterances) to ensure each segment belongs to a single speaker. The primary components of the pipeline are:

Deep Embedding Generation: This is achieved using Resemblyzer's open-source pre-trained neural network, which generates 256-dimensional embeddings for each utterance. Despite being trained exclusively on English, this model successfully encodes speaker information for Hindi as demonstrated by the authors.
Clustering Algorithm: The authors employ the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to classify the deep embeddings into speaker clusters. A novel approach to segment large datasets into partial sets is utilized to combat memory constraints and improve cluster accuracy. The iterative merging of clusters based on cosine similarity aids in refining the groupings.
Cluster Evaluation Metrics: Two metrics, Cluster Purity and Cluster Uniqueness, are proposed for evaluating the effectiveness of clustering. Cluster Purity assesses how well a cluster consists of utterances from a singular dominant speaker, while Cluster Uniqueness measures the exclusivity of clusters to individual speakers.

Results and Analysis

The analysis conducted on a test set consisting of 80 speakers over 20 hours, equally divided between male and female speakers, yielded significant findings:

Cluster Purity was reported at 96%, indicating that individual clusters primarily comprised utterances from a single speaker.
Cluster Uniqueness stood at approximately 84.81%, revealing that a majority of speaker clusters were dominated by sole speaker identities.
Only 1.35% of utterances were categorized as noise, suggesting the algorithm’s effectiveness in minimizing data loss.

These results underscore the pipeline's proficiency in accurately identifying speaker clusters from unlabeled datasets, even when the dataset is diverse in gender composition.

Implications and Future Directions

The implications of this study are profound for the field of unsupervised speaker recognition. It delineates a model-free clustering framework that operates efficiently across linguistic boundaries, which is particularly beneficial for multilingual speaker recognition applications. Future directions could involve:

Training voice encoder models specifically on Indic languages to improve embeddings especially when the data exhibits substantial linguistic diversity.
Exploring deep clustering methodologies to potentially override the need for explicit hyperparameter tuning associated with traditional clustering algorithms.
Examining effects that single-gender datasets have on clustering effectiveness and addressing speaker similarity issues, particularly among speakers of the same gender, to enhance cluster purity.

In conclusion, "Speaker Recognition in the Wild" provides a robust framework for unsupervised speaker recognition in datasets lacking prior speaker labels. Its approach to using pre-trained English LLMs on Indic data exemplifies adaptability across linguistic boundaries, thereby opening avenues for further advancements in non-English speaker recognition systems.

Markdown Report Issue