Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement (2401.04511v1)
Abstract: The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the one embedded in the target audio while retaining the speaker and speech content from the source. The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings. Using these factors, we propose a framework to reconstruct the pitch contour of the given speech signal and train a decoder that reconstructs the speech signal. The model is trained using a self-supervision based reconstruction loss. During conversion, the emotion embedding is alone derived from the target audio, while rest of the factors are derived from the source audio. In our experiments, we show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model using objective and subjective quality evaluations.
- “The age of artificial emotional intelligence,” Computer, vol. 51, no. 9, pp. 38–46, 2018.
- Yongcheng Jing et al., “Neural style transfer: A review,” IEEE transactions on visualization and computer graphics, vol. 26, no. 11, pp. 3365–3385, 2019.
- Kaizhi Qian et al., “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in ICML. PMLR, 2019, pp. 5210–5219.
- “Leveraging symmetrical convolutional transformer networks for speech to singing voice style transfer,” arXiv preprint arXiv:2208.12410, 2022.
- Berrak Sisman et al., “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020.
- Kun Zhou et al., “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2022.
- Ryo Aihara et al., “GMM-based emotional voice conversion using spectrum and prosody features,” American Journal of Signal Processing, vol. 2, no. 5, pp. 134–138, 2012.
- “Data-driven emotion conversion in spoken english,” Speech Communication, vol. 51, no. 3, pp. 268–283, 2009.
- Huaiping Ming et al., “Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion,” in INTERSPEECH 2016, 2016, pp. 2453–2457.
- Georgios Rizos et al., “StarGAN for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP. IEEE, 2020, pp. 3502–3506.
- Ravi Shankar et al., “Multi-speaker emotion conversion via latent variable regularization and a chained encoder-decoder-predictor network,” arXiv preprint arXiv:2007.12937, 2020.
- Yun Chen et al., “Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion,” in INTERSPEECH 2023, 2023, pp. 2068–2072.
- Alexei Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, vol. 33, pp. 12449–12460, 2020.
- Wei-Ning Hsu et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
- Adam Polyak et al., “Speech resynthesis from discrete disentangled self-supervised representations,” arXiv preprint arXiv:2104.00355, 2021.
- Jonathan Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP. IEEE, 2018, pp. 4779–4783.
- Zalán Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, vol. 33, pp. 17022–17033, 2020.
- Tao Li et al., “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1448–1460, 2022.
- Kun Zhou et al., “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP. IEEE, 2021, pp. 920–924.
- Jian Gao et al., “Nonparallel emotional speech conversion,” arXiv preprint arXiv:1811.01174, 2018.
- Arsha Nagrani et al., “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
- Raghavendra Pappagari et al., “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP. IEEE, 2020, pp. 7169–7173.
- Zein Shaheen et al., “Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech,” in INTERSPEECH 2023, 2023, pp. 2038–2042.
- Yaroslav Ganin et al., “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
- “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
- “Switchboard: Telephone speech corpus for research and development,” in ICASSP. IEEE Computer Society, 1992, vol. 1, pp. 517–520.
- “HCAM–Hierarchical Cross Attention Model for Multi-modal Emotion Recognition,” arXiv preprint arXiv:2304.06910, 2023.
- Ashish Vaswani et al., “Attention is all you need,” NeurIPS, vol. 30, 2017.
- “Yet another algorithm for pitch tracking,” in ICASSP. IEEE, 2002, vol. 1, pp. I–361.
- Houwei Cao et al., “CREMA-D: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
- John S Garofolo et al., “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, pp. 27403, 1993.
- “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in IEEE SLT. IEEE, 2021, pp. 415–422.
- Vassil Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in ICASSP. IEEE, 2015, pp. 5206–5210.
Collections
Sign up for free to add this paper to one or more collections.