Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks (2404.04645v1)

Published 6 Apr 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Kurniawati Azizah and Wisnu Jatmiko. 2022. Transfer learning, style control, and speaker reconstruction loss for zero-shot multilingual multi-speaker text-to-speech on low-resource languages. IEEE Access, 10:5895–5911.
  2. One tts alignment to rule them all. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6092–6096. IEEE.
  3. Vector-quantized input-contextualized soft prompts for natural language understanding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6776–6791, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  4. A scalable model specialization framework for training and inference using submodels and its application to speech model personalization. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 5125–5129. ISCA.
  5. Sc-glowtts: An efficient zero-shot multi-speaker text-to-speech model. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3645–3649. ISCA.
  6. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.
  7. Adaspeech: Adaptive text to speech for custom voice. In International Conference on Learning Representations.
  8. Multispeech: Multi-speaker text to speech with transformer. In Interspeech.
  9. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6184–6188. IEEE.
  10. Text-to-speech for under-resourced languages: Phoneme mapping and source language selection in transfer learning. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 16–22.
  11. Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7902–7906. IEEE.
  12. Hypernetworks. ArXiv, abs/1609.09106.
  13. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  14. Adapter-based extension of multi-speaker text-to-speech model for new speakers. ArXiv, abs/2211.00585.
  15. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  16. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.
  17. Likert scale: Explored and explained. British journal of applied science & technology, 7(4):396.
  18. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision.
  19. Glow-tts: A generative flow for text-to-speech via monotonic alignment search.
  20. A dynamic convolutional layer for short rangeweather prediction. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4840–4848.
  21. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
  22. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6706–6713.
  23. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691.
  24. Evaluating parameter-efficient transfer learning approaches on sure benchmark for speech understanding. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  25. Lightspeech: Lightweight and fast text to speech with neural architecture search. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE.
  26. Florian Lux and Thang Vu. 2022. Language-agnostic meta-learning for low-resource text-to-speech with articulatory features. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6858–6868.
  27. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 565–576.
  28. Montreal forced aligner: Trainable text-speech alignment using kaldi. In INTERSPEECH.
  29. Adaptermix: Exploring the efficacy of mixture of adapters for low-resource tts adaptation. arXiv preprint arXiv:2305.18028.
  30. A review of deep learning techniques for speech processing. Information Fusion, page 101869.
  31. Efficienttts: An efficient and high-quality text-to-speech architecture. In International Conference on Machine Learning, pages 7700–7709. PMLR.
  32. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning, pages 7748–7759. PMLR.
  33. Hideyuki Mizuno and Masanobu Abe. 1995. Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt. Speech communication, 16(2):153–164.
  34. Voxceleb: a large-scale speaker identification dataset. Telephony, 3:33–039.
  35. Lightweight prosody-tts for multi-lingual multi-speaker scenario. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE.
  36. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  37. Non-autoregressive neural text-to-speech. In International Conference on Machine Learning, pages 7586–7598. PMLR.
  38. Adapterfusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503.
  39. Fastspeech 2: Fast and high-quality end-to-end text to speech. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  40. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
  41. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, volume 32, pages 3171–3180. Curran Associates, Inc.
  42. Conditioned regression models for non-blind single image super-resolution. 2015 IEEE International Conference on Computer Vision (ICCV), pages 522–530.
  43. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR, abs/1712.05884.
  44. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  45. RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models.
  46. Silero Team. 2021. Silero models: pre-trained enterprise-grade stt / tts models and benchmarks. https://github.com/snakers4/silero-models.
  47. Hyper-x: A unified hypernetwork for multi-task multilingual transfer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7934–7949.
  48. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis.
  49. Attention is all you need. Advances in neural information processing systems, 30.
  50. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pages 1–4. IEEE.
  51. Neural codec language models are zero-shot text to speech synthesizers.
  52. Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR, abs/1703.10135.
  53. Parameter-efficient learning for text-to-speech accent adaptation. arXiv preprint arXiv:2305.11320.
  54. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882.
  55. Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yingting Li (8 papers)
  2. Rishabh Bhardwaj (30 papers)
  3. Ambuj Mehrish (15 papers)
  4. Bo Cheng (51 papers)
  5. Soujanya Poria (138 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com