Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix (2405.10786v1)

Published 17 May 2024 in eess.AS

Abstract: Speaker anonymization is an effective privacy protection solution that aims to conceal the speaker's identity while preserving the naturalness and distinctiveness of the original speech. Mainstream approaches use an utterance-level vector from a pre-trained automatic speaker verification (ASV) model to represent speaker identity, which is then averaged or modified for anonymization. However, these systems suffer from deterioration in the naturalness of anonymized speech, degradation in speaker distinctiveness, and severe privacy leakage against powerful attackers. To address these issues and especially generate more natural and distinctive anonymized speech, we propose a novel speaker anonymization approach that models a matrix related to speaker identity and transforms it into an anonymized singular value transformation-assisted matrix to conceal the original speaker identity. Our approach extracts frame-level speaker vectors from a pre-trained ASV model and employs an attention mechanism to create a speaker-score matrix and speaker-related tokens. Notably, the speaker-score matrix acts as the weight for the corresponding speaker-related token, representing the speaker's identity. The singular value transformation-assisted matrix is generated by recomposing the decomposed orthonormal eigenvectors matrix and non-linear transformed singular through Singular Value Decomposition (SVD). Experiments on VoicePrivacy Challenge datasets demonstrate the effectiveness of our approach in protecting speaker privacy under all attack scenarios while maintaining speech naturalness and distinctiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. T. Zhou, Y. Zhao, and J. Wu, “Resnext and res2net structures for speaker verification,” in 2021 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 301–307.
  2. Q. Wang, P. Guo, S. Sun, L. Xie, and J. H. Hansen, “Adversarial regularization for end-to-end robust speaker verification.” in Proceedings INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, 2019, pp. 4010–4014.
  3. Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4889–4893.
  4. A. A. Dibazar, S. Narayanan, and T. W. Berger, “Feature analysis for automatic detection of pathological speech,” in Proceedings of the second joint 24th annual conference and the annual fall meeting of the biomedical engineering society][engineering in medicine and biology, vol. 1.   IEEE, 2002, pp. 182–183.
  5. B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi et al., “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, 2013, p. 395–399.
  6. A. Nautsch, C. Jasserand, E. Kindt, M. Todisco, I. Trancoso, and N. Evans, “The gdpr & speech data: Reflections of legal and technology communities, first steps towards a common understanding,” in Proceedings INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, 2019, p. 3695–3699.
  7. N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J.-F. Bonastre, “The voiceprivacy 2022 challenge evaluation plan,” arXiv preprint arXiv:2203.12468, 2022.
  8. J. Yao, Q. Wang, Y. Lei, P. Guo, L. Xie, N. Wang, and J. Liu, “Distinguishable speaker anonymization based on formant and fundamental frequency scaling,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  9. N. A. Tomashenko, B. M. L. Srivastava, X. Wang, E. Vincent, A. Nautsch, J. Yamagishi, N. W. D. Evans, J. Patino, J. Bonastre, P. Noé, and M. Todisco, “Introducing the voiceprivacy initiative,” in Proceedings INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 1693–1697.
  10. N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien et al., “The voiceprivacy 2020 challenge: Results and findings,” Computer Speech & Language, vol. 74, p. 101362, 2022.
  11. J. Patino, N. A. Tomashenko, M. Todisco, A. Nautsch, and N. W. D. Evans, “Speaker anonymisation using the mcadams coefficient,” in Proceedings INTERSPEECH 2021, 22nd Annual Conference of the International Speech Communication Association, 2021, pp. 1099–1103.
  12. P. Gupta, G. P. Prajapati, S. Singh, M. R. Kamble, and H. A. Patil, “Design of voice privacy system using linear prediction,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2020.   IEEE, 2020, pp. 543–549.
  13. L. Tavi, T. Kinnunen, and R. G. Hautamäki, “Improving speaker de-identification with functional data analysis of f0 trajectories,” Speech Communication, vol. 140, pp. 1–10, 2022.
  14. C. Huang, Y. Y. Lin, H. Lee, and L. Lee, “Defending your voice: Adversarial attack on voice conversion,” in IEEE Spoken Language Technology Workshop, SLT, 2021, pp. 552–559.
  15. B. M. L. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi, “Privacy and utility of x-vector based speaker anonymization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2383–2395, 2022.
  16. X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko, “Language-independent speaker anonymization using orthogonal householder neural network,” arXiv preprint arXiv:2305.18823, 2023.
  17. X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. A. Tomashenko, “Language-independent speaker anonymization approach using self-supervised pre-trained models,” in Odyssey 2022: The Speaker and Language Recognition Workshop, 2022, pp. 279–286.
  18. J. Yao, Q. Wang, L. Zhang, P. Guo, Y. Liang, and L. Xie, “Nwpu-aslp system for the voiceprivacy 2022 challenge,” arXiv preprint arXiv:2209.11969, 2022.
  19. C. O. Mawalim, K. Galajit, J. Karnjana, S. Kidani, and M. Unoki, “Speaker anonymization by modifying fundamental frequency and x-vector singular value,” Computer Speech & Language, vol. 73, p. 101326, 2022.
  20. B. M. L. Srivastava, N. A. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for x-vector based speaker anonymization,” in Proceedings INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 1713–1717.
  21. C. O. Mawalim, K. Galajit, J. Karnjana, and M. Unoki, “X-vector singular value modification and statistical-based decomposition with ensemble regression modeling for speaker anonymization system,” in Proceedings INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 1703–1707.
  22. V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proceedings INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, 2015, pp. 3214–3218.
  23. A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proceedings INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 5036–5040.
  24. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2018, pp. 5329–5333.
  25. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proceedings INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 3830–3834.
  26. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in 9th International Conference on Learning Representations, ICLR 2021, 2021.
  27. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, 2020.
  28. R. Xiao, H. Zhang, and Y. Lin, “Dgc-vector: A new speaker embedding for zero-shot voice conversion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, 2022, pp. 6547–6551.
  29. D. J. Hirst, “A praat plugin for momel and intsint with improved algorithms for modeling and coding intonation,” in Proceedings ICPhS, 16th International Congress of Phonetic Sciences, 2007.
  30. I. Cohen, Y. Huang, J. Chen, J. Benesty, J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation coefficient,” Noise reduction in speech processing, pp. 1–4, 2009.
  31. P.-G. Noé, A. Nautsch, N. Evans, J. Patino, J.-F. Bonastre, N. Tomashenko, and D. Matrouf, “Towards a unified assessment framework of speech pseudonymisation,” Computer Speech & Language, vol. 72, p. 101299, 2022.
  32. P.-G. Noé, J.-F. Bonastre, D. Matrouf, N. Tomashenko, A. Nautsch, and N. Evans, “Speech pseudonymisation assessment using voice similarity matrices,” arXiv preprint arXiv:2008.13144, 2020.
  33. F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker anonymization using x-vector and neural waveform models,” arXiv preprint arXiv:1905.13561, 2019.
  34. R. Doddipatla, N. Braunschweiler, and R. Maia, “Speaker adaptation in dnn-based speech synthesis using d-vectors.” in Proceedings INTERSPEECH 2017, 18st Annual Conference of the International Speech Communication Association, 2017, pp. 3404–3408.
  35. J. M. Perero-Codosero, F. M. Espinoza-Cuadros, and L. A. H. Gómez, “X-vector anonymization using autoencoders and adversarial training for preserving speech privacy,” Computer Speech & Language, vol. 74, p. 101351, 2022.
  36. S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu, “Anonymizing speech with generative adversarial networks to preserve speaker privacy,” in IEEE Spoken Language Technology Workshop, SLT, 2022, pp. 912–919.
  37. H. Turner, G. Lovisotto, and I. Martinovic, “Generating identities with mixture models for speaker anonymization,” Computer Speech & Language, vol. 72, p. 101318, 2022.
  38. S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker anonymization with phonetic intermediate representations,” in Proceedings INTERSPEECH 2022, 23rd Annual Conference of the International Speech Communication Association, 2022, pp. 4925–4929.
  39. S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody is not identity: A speaker anonymization approach using prosody cloning,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  40. P. Champion, D. Jouvet, and A. Larcher, “A study of f0 modification for x-vector based speech pseudonymization across gender,” arXiv preprint arXiv:2101.08478, 2021.
  41. X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko, “Speaker anonymization using orthogonal householder neural network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  42. B. van Niekerk, M.-A. Carbonneau, J. Zaïdi, M. Baas, H. Seuté, and H. Kamper, “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6562–6566.
  43. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  44. K. Kasi and S. A. Zahorian, “Yet another algorithm for pitch tracking,” in ICASSP 2002-2002, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2002, pp. I–361.
  45. Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei, C. Wang, X. Yin, Z. Ma et al., “Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts,” arXiv preprint arXiv:2307.07218, 2023.
  46. S. McAdams, “Musical timbre perception,” The psychology of music, pp. 35–67, 2013.
  47. D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7748–7759.
  48. J. R. Hershey and P. A. Olsen, “Approximating the kullback leibler divergence between gaussian mixture models,” in ICASSP 2007-2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2007, pp. IV–317.
  49. P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” in ICML, International Conference on Machine Learning, 2020, pp. 1779–1788.
  50. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 5206–5210.
  51. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, 2018, pp. 1086–1090.
  52. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, 2019, pp. 1526–1530.
  53. C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jixun Yao (35 papers)
  2. Qing Wang (341 papers)
  3. Pengcheng Guo (55 papers)
  4. Ziqian Ning (15 papers)
  5. Lei Xie (337 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.