In defence of metric learning for speaker recognition (2003.11982v2)

Published 26 Mar 2020 in eess.AS and cs.SD

Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.

Citations (422)

View on Semantic Scholar

Summary

The paper demonstrates that an angular prototypical loss variant achieves superior performance over classification-based approaches in open-set speaker recognition.
The study rigorously compares several loss functions using the VoxCeleb dataset, emphasizing enhanced inter-speaker separability and reduced intra-speaker variability.
Extensive experiments using 20,000 GPU-hours reveal that larger batch sizes and tuned hyperparameters significantly boost the effectiveness of metric learning frameworks.

An Evaluation of Metric Learning in Speaker Recognition

The paper "In Defence of Metric Learning for Speaker Recognition" explores the domain of speaker recognition, specifically focusing on open-set scenarios where the target is to generalize well to unseen speakers. The authors address a contentious belief prevalent in the field: that classification-based approaches outperform metric learning methods. Through rigorous experimentation, the paper compares traditional classification objectives against metric learning objectives, eventually proposing an angular variant of the prototypical loss function that demonstrates superior performance.

Overview of Research Focus

Speaker recognition has evolved into two principal challenges: closed-set and open-set recognition. Closed-set recognition is more analogous to a classification problem where all target speakers are represented in the training set, while open-set recognition poses a more practical scenario where new, unseen speakers are to be identified. This paper focuses on open-set speaker recognition, emphasizing the necessity for embeddings that maintain low intra-speaker variability and high inter-speaker separability.

Methodology and Loss Functions

The paper provides a systematic comparison of various loss functions used for training speaker recognition models, utilizing the VoxCeleb dataset as the evaluation benchmark. Among the loss functions investigated are:

Softmax Loss: Standard cross-entropy loss, used widely for classification, though lacking in explicit inter-class margin enforcement.
AM-Softmax and AAM-Softmax: Variants of softmax that integrate margin-based penalties to enhance discriminative power.
Triplet Loss: A metric learning loss that directly optimizes for intra-class compactness and inter-class margin.
Prototypical Networks: Metric learning approach that uses class prototypes to determine classification.
Generalized End-to-End (GE2E) Loss: Previously proposed loss specific for speaker recognition, facilitating better separability with multi-negative samples.

Experimental Setup and Results

The authors conducted an exhaustive suite of 20,000 GPU-hours worth of experiments, maintaining consistent training conditions across different loss functions and network architectures such as Thin ResNet and VGG-M-40. The paper includes detailed experimentation with respect to various hyperparameters, notably for AM-Softmax and AAM-Softmax, which were determined to be highly sensitive to parameter settings like margin and scale.

Key findings from the experiments reveal that:

Vanilla triplet loss networks perform comparably to models trained with AM-Softmax across different architectures.
Metric learning objectives like prototypical and GE2E losses outperform classification approaches, with the suggested angular prototypical loss demonstrating the best performance across the board.
Larger batch sizes positively influence metric learning approaches due to improved mining of challenging negative samples.

Implications and Future Directions

The implications of this research are twofold—practical and theoretical. Practically, it demonstrates that metric learning can be a robust alternative to classification objectives in open-set speaker recognition, especially in complex and large-scale settings. Theoretically, it challenges the community’s prevailing sentiments about classification dominant approaches, proving that carefully designed metric learning frameworks can outperform them.

The findings prompt an avenue for future research in improving robustness and efficiency of metric learning frameworks. Moreover, exploring hybrid approaches that combine metric learning with classification tasks may further enhance speaker recognition systems. As AI and deep learning models continue to advance, the proposed methodologies from this research may play a critical role in developing systems that can generalize well across myriad applications involving speaker recognition.

Overall, this paper provides a comprehensive and well-substantiated analysis advocating the prowess of metric learning in evolving speaker recognition contexts, and sets the groundwork for future explorations in this field.

PDF Markdown