Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Personalized Neural Speech Codec (2404.00791v1)

Published 31 Mar 2024 in cs.SD and eess.AS

Abstract: In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “WaveNet: A Generative Model for Raw Audio,” in Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, p. 125.
  2. “WaveNet based low rate speech coding,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 676–680.
  3. Y. Li C. Garbacea, A. van den Oord, “Low bit-rate speech coding with VQ-VAE and a WaveNet decoder,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
  4. J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
  5. “Efficient neural audio synthesis,” in Proc. of the International Conference on Machine Learning (ICML), 2018, vol. 80, pp. 2410–2419.
  6. J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s using LPCNet,” in Proc. Interspeech, 2019.
  7. “End-to-End LPCNet: A neural vocoder with fully-differentiable LPC estimation,” arXiv preprint arXiv:2202.11301, 2022.
  8. “Neural speech synthesis on a shoestring: Improving the efficiency of LPCNet,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022.
  9. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, pp. 495–507, jan 2022.
  10. “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  11. “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017.
  12. “Latent-domain predictive neural speech coding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2111–2123, 2023.
  13. “High-fidelity audio compression with improved RVQGAN,” arXiv preprint arXiv:2306.06546, 2023.
  14. “PostGAN: A gan-based post-processor to enhance the quality of coded speech,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022, pp. 831–835.
  15. A. Sivaraman and M. Kim, “Efficient Personalized Speech Enhancement Through Self-Supervised Learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1342–1356, 2022.
  16. A. Sivaraman and M. Kim, “Zero-shot personalized speech enhancement through speaker-informed model selection,” in Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
  17. S. Kim and M. Kim, “Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation,” in Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
  18. “Fast real-time personalized speech enhancement: End-to-end enhancement network (E3Net) and knowledge distillation,” arXiv preprint arXiv:2204.00771, 2022.
  19. ISO/IEC DIS 23003-3, “Information technology – MPEG audio technologies – part 3: Unified speech and audio coding,” 2011.
  20. ETSI TS 126 445 V13. 2.0, “Universal Mobile Telecommunications System (UMTS); LTE; Codec for Enhanced Voice Services (EVS); Detailed algorithmic description (3GPP TS 26.445 version 13.2. 0 Release 13,” 2016.
  21. “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  22. “Signature verification using a “siamese” time delay neural network,” in Advances in Neural Information Processing Systems (NIPS), 1994, pp. 737–744.
  23. D. Chicco, Siamese Neural Networks: An Overview, pp. 73–94, Springer US, New York, NY, 2021.
  24. “Visualizing Data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
  25. D.P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the International Conference on Learning Representations (ICLR), 2015.
  26. ITU-R Recommendation BS 1534-3, “Method for the subjective assessment of intermediate quality levels of coding systems (MUSHRA),” 2015.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com