Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models (2401.12440v1)
Abstract: Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enroLLMent-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enroLLMent utterances, and online from runtime utterances. Due to the distinct circumstances of enroLLMent and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enroLLMent-verification framework that uses different models for enroLLMent and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.
- “X-Vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
- “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
- “Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition,” in Proc. Interspeech, 2021, pp. 4603–4607.
- “Towards lightweight applications: Asymmetric enroll-verify structure for speaker verification,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7067–7071.
- A/B testing: The most powerful way to turn clicks into customers, John Wiley & Sons, 2015.
- “Analyzing Speaker Verification Embedding Extractors and Back-Ends Under Language and Channel Mismatch,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 9–16.
- “STC Speaker Recognition System for the NIST SRE 2021,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 354–361.
- “Universal speaker recognition encoders for different speech segments duration,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
- “SphereFace2: Binary classification is all you need for deep face recognition,” in International Conference on Learning Representations, 2022.
- “Knowledge distillation for small foot-print deep speaker embedding,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6021–6025.
- “Self-supervised learning from a multi-view perspective,” in International Conference on Learning Representations, 2021.
- “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- Abien Fred Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- “Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition,” in Proc. Interspeech 2021, 2021, pp. 4593–4597.