Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Version Control of Speaker Recognition Systems (2007.12069v8)

Published 23 Jul 2020 in eess.AS, cs.DC, cs.NI, and cs.SE

Abstract: This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enroLLMent stage, where a profile is generated from user-provided enroLLMent audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for speaker recognition systems that had been carefully studied at Google from years of engineering practice. These strategies are categorized into three groups according to how they are deployed in the production environment: device-side deployment, server-side deployment, and hybrid deployment. To compare different strategies with quantitative metrics under various network configurations, we present SpeakerVerSim, an easily-extensible Python-based simulation framework for different server-side deployment strategies of speaker recognition systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. M. Hébert, “Text-dependent speaker recognition,” in Springer Handbook of Speech Processing.   Springer, 2008, pp. 743–762.
  2. G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 4087–4091.
  3. T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.
  4. R. Rikhye, Q. Wang, Q. Liang, Y. He, D. Zhao, A. Narayanan, I. McGraw et al., “Personalized keyphrase detection using speaker and environment information,” in Proc. Interspeech, 2021.
  5. L. G. Kersta, “Voiceprint identification,” The Journal of the Acoustical Society of America, vol. 34, no. 5_Supplement, pp. 725–725, 1962.
  6. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  7. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  9. F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-based models for text-dependent speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5359–5363.
  10. Y. Pinsky, “Tomato, tomahto. Google Home now supports multiple users,” Google Assistant Blog, 2017.
  11. H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
  12. S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
  13. C. Kim and R. M. Stern, “Power-normalized cepstral coefficients (PNCC) for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016.
  14. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  15. P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, vol. 14, pp. 28–29, 2005.
  16. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  17. L. Wan, Q. Wang, A. Papir, and I. Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4879–4883.
  18. C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, vol. 650, 2017.
  19. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5329–5333.
  20. J. Pelecanos, Q. Wang, Y. Huang, and I. L. Moreno, “Parameter-free attentive scoring for speaker verification,” in Odyssey: The Speaker and Language Recognition Workshop, 2022.
  21. D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2016, pp. 165–170.
  22. J. Pelecanos, Q. Wang, and I. L. Moreno, “Dr-Vectors: Decision residual networks and an improved loss for speaker recognition,” in Proc. Interspeech, 2021.
  23. R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 4704–4708.
  24. A. Mallick, K. Hsieh, B. Arzani, and G. Joshi, “Matchmaker: data drift mitigation in machine learning for large-scale systems,” Proceedings of Machine Learning and Systems, vol. 4, pp. 77–94, 2022.
  25. A. Ratner, D. Alistarh, G. Alonso, D. G. Andersen, P. Bailis, S. Bird, N. Carlini, B. Catanzaro, J. Chayes, E. Chung et al., “MLSys: The new frontier of machine learning systems,” arXiv preprint arXiv:1904.03257, 2019.
  26. T. D. Timur, I. K. E. Purnama, and S. M. S. Nugroho, “Deploying scalable face recognition pipeline using distributed microservices,” in International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM).   IEEE, 2019, pp. 1–5.
  27. D. S. Bolme, N. Srinivas, J. Brogan, and D. Cornett, “Face recognition oak ridge (faro): A framework for distributed and scalable biometrics applications,” in International Joint Conference on Biometrics (IJCB).   IEEE, 2020, pp. 1–8.
  28. H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Versatile, scalable, and accurate simulation of distributed applications and platforms,” Journal of Parallel and Distributed Computing, vol. 74, no. 10, pp. 2899–2917, Jun. 2014.
  29. G. Kakulapati, “BigSim: A parallel simulator for performance prediction of extremely large parallel machines,” in International Parallel and Distributed Processing Symposium.   IEEE, 2004, p. 78.
  30. R. Buyya, R. Ranjan, and R. N. Calheiros, “Modeling and simulation of scalable cloud computing environments and the cloudsim toolkit: Challenges and opportunities,” in International conference on high performance computing & simulation.   IEEE, 2009, pp. 1–11.
  31. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Proc. Interspeech, 2019, pp. 2728–2732.
  32. Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu, Y. He, W. Li, J. Pelecanos, M. Nika, and A. Gruenstein, “VoiceFilter-Lite: Streaming targeted voice separation for on-device speech recognition,” in Proc. Interspeech, 2020, pp. 2677–2681.
  33. R. Rikhye, Q. Wang, Q. Liang, Y. He, and I. McGraw, “Multi-user VoiceFilter-Lite via attentive speaker embedding,” in Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2021.
  34. ——, “Closing the gap between single-user and multi-user voicefilter-lite,” in Odyssey: The Speaker and Language Recognition Workshop, 2022.
  35. T. O’Malley, S. Ding, A. Narayanan, Q. Wang, R. Rikhye, Q. Liang, Y. He, and I. McGraw, “Conditional conformer: Improving speaker modulation for single and multi-user speech enhancement,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  36. S. Ding, Q. Wang, S.-y. Chang, L. Wan, and I. L. Moreno, “Personal VAD: Speaker-conditioned voice activity detection,” in Odyssey: The Speaker and Language Recognition Workshop, 2020.
  37. S. Ding, R. Rikhye, Q. Liang, Y. He, Q. Wang, A. Narayanan, T. O’Malley, and I. McGraw, “Personal vad 2.0: Optimizing personal voice activity detection for on-device speech recognition,” in Proc. Interspeech, 2022.
  38. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
  39. R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the efficient representation and execution of deep acoustic models,” in Proc. Interspeech, 2016, pp. 2746–2750.
  40. Y. Shangguan, J. Li, Q. Liang, R. Alvarez, and I. McGrawn, “Optimizing speech recognition for the edge,” arXiv preprint arXiv:1909.12408, 2019.
  41. S. Ding, P. Meadowlark, Y. He, L. Lew, S. Agrawal, and O. Rybakov, “4-bit conformer with native quantization aware training for speech recognition,” in Proc. Interspeech, 2022.
  42. O. Rybakov, P. Meadowlark, S. Ding, D. Qiu, J. Li, D. Rim, and Y. He, “2-bit conformer quantization for automatic speech recognition,” arXiv preprint arXiv:2305.16619, 2023.
  43. P. Nakkiran, R. Alvarez, R. Prabhavalkar, and C. Parada, “Compressing deep neural networks using a rank-constrained topology,” 2015.
  44. Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems, 1990, pp. 598–605.
  45. Z. Wu, D. Zhao, Q. Liang, J. Yu, A. Gulati, and R. Pang, “Dynamic sparsity neural networks for automatic speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6014–6018.
  46. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  47. M. Gupta, “Google Tensor is a milestone for machine learning,” Google Pixel Blog, 2021.
  48. S. De Silva, A. Liu, and L. Nabarro, “Europe’s tough new law on biometrics,” Biometric Technology Today, vol. 2017, no. 2, pp. 5–7, 2017.
  49. W. Xia, H. Lu, Q. Wang, A. Tripathi, Y. Huang, I. L. Moreno, and H. Sak, “Turn-to-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8077–8081.
  50. E. J. Ghomi, A. M. Rahmani, and N. N. Qader, “Load-balancing algorithms in cloud computing: A survey,” Journal of Network and Computer Applications, vol. 88, pp. 50–71, 2017.
  51. M. L. Waskom, “Seaborn: statistical data visualization,” Journal of Open Source Software, vol. 6, no. 60, p. 3021, 2021.
  52. A. M. Alakeel et al., “A guide to dynamic load balancing in distributed computer systems,” International journal of computer science and information security, vol. 10, no. 6, pp. 153–160, 2010.
Citations (9)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.