Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
34 tokens/sec
GPT-4o
83 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
471 tokens/sec
Kimi K2 via Groq Premium
203 tokens/sec
2000 character limit reached

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement (2401.04511v1)

Published 9 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the one embedded in the target audio while retaining the speaker and speech content from the source. The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings. Using these factors, we propose a framework to reconstruct the pitch contour of the given speech signal and train a decoder that reconstructs the speech signal. The model is trained using a self-supervision based reconstruction loss. During conversion, the emotion embedding is alone derived from the target audio, while rest of the factors are derived from the source audio. In our experiments, we show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model using objective and subjective quality evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “The age of artificial emotional intelligence,” Computer, vol. 51, no. 9, pp. 38–46, 2018.
  2. Yongcheng Jing et al., “Neural style transfer: A review,” IEEE transactions on visualization and computer graphics, vol. 26, no. 11, pp. 3365–3385, 2019.
  3. Kaizhi Qian et al., “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in ICML. PMLR, 2019, pp. 5210–5219.
  4. “Leveraging symmetrical convolutional transformer networks for speech to singing voice style transfer,” arXiv preprint arXiv:2208.12410, 2022.
  5. Berrak Sisman et al., “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020.
  6. Kun Zhou et al., “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2022.
  7. Ryo Aihara et al., “GMM-based emotional voice conversion using spectrum and prosody features,” American Journal of Signal Processing, vol. 2, no. 5, pp. 134–138, 2012.
  8. “Data-driven emotion conversion in spoken english,” Speech Communication, vol. 51, no. 3, pp. 268–283, 2009.
  9. Huaiping Ming et al., “Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion,” in INTERSPEECH 2016, 2016, pp. 2453–2457.
  10. Georgios Rizos et al., “StarGAN for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP. IEEE, 2020, pp. 3502–3506.
  11. Ravi Shankar et al., “Multi-speaker emotion conversion via latent variable regularization and a chained encoder-decoder-predictor network,” arXiv preprint arXiv:2007.12937, 2020.
  12. Yun Chen et al., “Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion,” in INTERSPEECH 2023, 2023, pp. 2068–2072.
  13. Alexei Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, vol. 33, pp. 12449–12460, 2020.
  14. Wei-Ning Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  15. “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
  16. Adam Polyak et al., “Speech resynthesis from discrete disentangled self-supervised representations,” arXiv preprint arXiv:2104.00355, 2021.
  17. Jonathan Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP. IEEE, 2018, pp. 4779–4783.
  18. Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  19. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, vol. 33, pp. 17022–17033, 2020.
  20. Tao Li et al., “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1448–1460, 2022.
  21. Kun Zhou et al., “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP. IEEE, 2021, pp. 920–924.
  22. Jian Gao et al., “Nonparallel emotional speech conversion,” arXiv preprint arXiv:1811.01174, 2018.
  23. Arsha Nagrani et al., “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
  24. Raghavendra Pappagari et al., “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP. IEEE, 2020, pp. 7169–7173.
  25. Zein Shaheen et al., “Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech,” in INTERSPEECH 2023, 2023, pp. 2038–2042.
  26. Yaroslav Ganin et al., “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  27. “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
  28. “Switchboard: Telephone speech corpus for research and development,” in ICASSP. IEEE Computer Society, 1992, vol. 1, pp. 517–520.
  29. “HCAM–Hierarchical Cross Attention Model for Multi-modal Emotion Recognition,” arXiv preprint arXiv:2304.06910, 2023.
  30. Ashish Vaswani et al., “Attention is all you need,” NeurIPS, vol. 30, 2017.
  31. “Yet another algorithm for pitch tracking,” in ICASSP. IEEE, 2002, vol. 1, pp. I–361.
  32. Houwei Cao et al., “CREMA-D: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
  33. John S Garofolo et al., “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, pp. 27403, 1993.
  34. “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in IEEE SLT. IEEE, 2021, pp. 415–422.
  35. Vassil Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in ICASSP. IEEE, 2015, pp. 5206–5210.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets