Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models (2401.12440v1)

Published 23 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enroLLMent-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enroLLMent utterances, and online from runtime utterances. Due to the distinct circumstances of enroLLMent and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enroLLMent-verification framework that uses different models for enroLLMent and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. “X-Vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  2. “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
  3. “Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition,” in Proc. Interspeech, 2021, pp. 4603–4607.
  4. “Towards lightweight applications: Asymmetric enroll-verify structure for speaker verification,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7067–7071.
  5. A/B testing: The most powerful way to turn clicks into customers, John Wiley & Sons, 2015.
  6. “Analyzing Speaker Verification Embedding Extractors and Back-Ends Under Language and Channel Mismatch,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 9–16.
  7. “STC Speaker Recognition System for the NIST SRE 2021,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 354–361.
  8. “Universal speaker recognition encoders for different speech segments duration,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  9. “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
  10. “SphereFace2: Binary classification is all you need for deep face recognition,” in International Conference on Learning Representations, 2022.
  11. “Knowledge distillation for small foot-print deep speaker embedding,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6021–6025.
  12. “Self-supervised learning from a multi-view perspective,” in International Conference on Learning Representations, 2021.
  13. “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  14. Abien Fred Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
  15. “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  16. “Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition,” in Proc. Interspeech 2021, 2021, pp. 4593–4597.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com