End-to-End Attention based Text-Dependent Speaker Verification (1701.00562v1)

Published 3 Jan 2017 in cs.CL and stat.ML

Abstract: A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetically discriminative/speaker discriminative DNNs as feature extractors for speaker verification has shown promising results. The extracted frame-level (DNN bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use speaker discriminative CNNs to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminative information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process --- directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can automatically select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows $10$ "Hey Cortana" speaker verification task.

Citations (175)

View on Semantic Scholar

Summary

The paper introduces speaker-discriminative CNNs that robustly extract frame-level features even in noisy environments.
The attention mechanism integrates phonetic context to smartly weight and aggregate speaker features.
An end-to-end joint optimization directly maps utterances to verification scores, reducing equal error rates.

End-to-End Attention Based Text-Dependent Speaker Verification

The paper presents a sophisticated approach to speaker verification, specifically tailored for text-dependent scenarios, using end-to-end attention-based systems. Unlike previous methods that employed DNNs as indiscriminate feature extractors, the authors propose a robust, noise-resistant methodology leveraging speaker discriminative CNNs. These CNNs extract frame-level features that are subsequently aggregated into utterance-level speaker vectors utilizing a specialized attention mechanism. This nuanced technique takes advantage of both phonetic and speaker discriminative information to enhance the efficacy of speaker verification systems.

Core Contributions

Speaker Discriminative CNNs: The paper introduces CNNs capable of extracting frame-level speaker features robustly, particularly under noisy conditions. This marks a departure from past systems that shifted focus to indiscriminate aggregation of features, often resulting in suboptimal performance in noise-prone environments. CNNs are chosen over LSTMs due to their demonstrated superiority in speech recognition tasks, particularly in processing specific frequency ranges and extracting spatial hierarchies in audio signals.
Attention Mechanism: The attention model proposed distinctly emphasizes the integration of contextually relevant phonetic information while learning to weigh these features based on speaker discriminatory characteristics. This mechanism smartly determines the combination weights, substantially improving the robustness and specificity of the speaker vectors.
End-to-End Joint Optimization: Diverging from traditional approaches that train models in isolated, sequential stages, the proposed system optimizes both CNN and attention mechanisms under a unified end-to-end criterion. This holistic training strategy ensures the system is finely tuned to map test and target utterances directly into verification scores, enhancing real-world applicability.

Experimental Validation

The methodology undergoes rigorous testing on the Windows 10 "Hey Cortana" task, demonstrating tangible improvements over established systems like GMM-UBM and i-vector/PLDA. The end-to-end system achieves a notable reduction in equal error rates, emphasizing its efficiency and practical applicability.

Implications and Future Directions

The research strengthens the viability of deep learning models in precise speaker verification, particularly in environments where specific phrases or keywords are used for authentication. The integration of attention mechanisms presents a pathway for future systems, especially in text-independent contexts where phonetic variation poses greater challenges.

The implications of this research extend beyond immediate technological advancements. It fosters increased reliability in voice-activated systems and contributes to securing sensitive voice-driven applications. As AI continues to permeate daily life, methodology such as the one proposed promotes the dual goals of enhanced security and seamless user experience.

Future advancements may further explore optimization algorithms that dynamically select impostors based on nuanced speaker characteristics, paving the way for robust, scalable systems tailored for diverse linguistic contexts. There is also potential to adapt these mechanisms for other biometric validation scenarios, further enhancing security measures within AI-driven environments.

PDF Markdown