Deep Speaker: an End-to-End Neural Speaker Embedding System (1705.02304v1)

Published 5 May 2017 in cs.CL

Abstract: We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

PDF Abstract

Deep Speaker: An End-to-End Neural Speaker Embedding System

The paper presents Deep Speaker, an end-to-end neural speaker embedding system designed for mapping utterances to a hypersphere, where speaker similarity is evaluated using cosine similarity. The embeddings produced by Deep Speaker are applicable to multiple tasks, including speaker identification, verification, and clustering. Key contributions of this research include the comparison of architectures like ResCNN and GRU for acoustic feature extraction, the utilization of mean pooling for generating utterance-level embeddings, and the implementation of a triplet loss function based on cosine similarity.

Methodology

Deep Speaker proposes a significant evolution from traditional i-vector and PLDA frameworks by focusing on joint optimization of speaker modeling processes within a unified architecture. The paper notably evaluates alternatives to the conventional GMM-UBM and DNN frameworks by incorporating state-of-the-art deep learning structures.

Two architectural paradigms were scrutinized:

Residual Convolutional Neural Network (ResCNN): Built on ResNet principles, this structure leverages stacked residual blocks to alleviate gradient vanishing issues and facilitate deeper network training, focusing on spectral variations and correlations.
Gated Recurrent Unit (GRU) Network: A recurrent alternative harnessing GRU layers, acclaimed for their effectiveness in processing sequences with long dependencies, commonly seen in speech processing tasks.

The system architecture includes frame-level features initialization, temporal pooling, affine transformation, and normalization to generate embeddings. The embedding's effectiveness is optimized using a triplet loss objective, prioritizing similarity among same-speaker utterances while maximizing dissimilarity from other speakers.

Experimental Evaluation

Deep Speaker was methodically validated through experiments on Mandarin and English datasets, specifically focusing on text-independent and text-dependent applications. Metrics such as Equal Error Rate (EER) and accuracy were employed to quantify performance improvements over baseline systems, particularly a DNN-based i-vector framework.

Key findings include:

A substantial reduction of 50% in EER and approximately 60% enhancement in speaker identification accuracy using Deep Speaker on text-independent tasks compared to i-vector baselines.
The combination of softmax pre-training with triplet loss fine-tuning yielded optimal model performance, underlining the advantage of initial categorical training followed by metric learning.
ResCNN and GRU architectures collectively demonstrated the capacity for large-scale data training, highlighting significant performance increments with increased training size.

Implications and Future Work

These results indicate promising implications for speaker recognition, proposing an architecture that generalizes well to large datasets while minimizing traditional system limitations. The experiments suggest potential for cross-linguistic transfer learning, as demonstrated by performances on English data, indicating robust feature abstraction capabilities across linguistic domains.

Future exploration could focus on enhancing the model's efficiency concerning computation and memory demands, essential for real-world deployment. Investigating further architectures and loss functions could also potentially optimize the embedding quality.

The paper opens avenues for applying end-to-end deep learning frameworks to diverse speaker recognition problems, promising more accurate and efficient solutions in various linguistic contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Chao Li (429 papers)
Xiaokong Ma (1 paper)
Bing Jiang (8 papers)
Xiangang Li (46 papers)
Xuewei Zhang (13 papers)
Xiao Liu (402 papers)
Ying Cao (30 papers)
Ajay Kannan (9 papers)
Zhenyao Zhu (11 papers)

Citations (488)

View on Semantic Scholar

Deep Speaker: an End-to-End Neural Speaker Embedding System (1705.02304v1)