Unified speech and gesture synthesis using flow matching (2310.05181v2)
Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks. Please see https://shivammehta25.github.io/Match-TTSG/ for video examples and code.
- A. Kendon, “How gestures can become like words,” in Cross-Cultural Perspectives in Nonverbal Communication, 1988.
- S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” Comput. Graph. Forum, 2023.
- P. Wagner, Z. Malisz, and S. Kopp, “Gesture and speech in interaction: An overview,” Speech Commun., vol. 57, 2014.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018.
- S. Mehta, A. Kirkland, H. Lameris, J. Beskow, É. Székely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. Interspeech, 2023.
- É. Székely, G. E. Henter, J. Beskow, and J. Gustafson, “Breathing and speech planning in spontaneous speech synthesis,” in Proc. ICASSP, 2020, pp. 7649–7653.
- S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! Audio-driven motion synthesis with diffusion models,” ACM ToG, vol. 42, no. 4, 2023, article 44.
- T. Ao, Z. Zhang, and L. Liu, “GestureDiffuCLIP: Gesture diffusion model with CLIP latents,” ACM ToG, vol. 42, no. 4, 2023, article 42.
- Y. Yoon, P. Wolfert, T. Kucherenko, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation,” in Proc. ICMI, 2022, pp. 736–747.
- S. Mehta, S. Wang, S. Alexanderson, J. Beskow, É. Székely, and G. E. Henter, “Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis,” in Proc. SSW, 2023.
- T. Kucherenko, R. Nagy, Y. Yoon, J. Woo, T. Nikolov, M. Tsakov, and G. E. Henter, “The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings,” in Proc. ICMI, 2023.
- Y. Lipman, R. T. Q. Chen, H. Ben-Hamu et al., “Flow matching for generative modeling,” in Proc. ICLR, 2023.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
- J. Kim, J. Kong, and J. Son, “VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, 2021, pp. 5530–5540.
- S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style-controllable speech-driven gesture synthesis using normalising flows,” Comput. Graph. Forum, vol. 39, no. 2, 2020.
- C. Yu, H. Lu, N. Hu, M. Yu, C. Weng et al., “DurIAN: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
- K. Mitsui, Y. Hono, and K. Sawada, “UniFLG: Unified facial landmark generator from text or speech,” in Proc. Interspeech, 2023, pp. 5501–5505.
- M. Salem, S. Kopp, I. Wachsmuth, and F. Joublin, “Towards an integrated model of speech and gesture production for multi-modal robot behavior,” in Proc. RO-MAN, 2010, pp. 614–619.
- S. Alexanderson, É. Székely, G. E. Henter, T. Kucherenko, and J. Beskow, “Generating coherent spontaneous speech and gesture from text,” in Proc. IVA, 2020, pp. 1–3.
- S. Wang, S. Alexanderson, J. Gustafson, J. Beskow, G. E. Henter, and É. Székely, “Integrated speech and gesture synthesis,” in Proc. ICMI, 2021, pp. 177–185.
- J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” in Proc. NeurIPS, 2020, pp. 8067–8077.
- R. T. Q. Chen, Y. Rubanova, J. Bettencourt et al., “Neural ordinary differential equations,” in Proc. NeurIPS, 2018.
- X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in Proc. ICLR, 2023.
- M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
- V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, E. Gavves et al., “Motion flow matching for human motion synthesis and editing,” arXiv preprint arXiv:2312.08895, 2023.
- Y. Ren, C. Hu, X. Tan et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
- J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “RoFormer: Enhanced Transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
- U. Wennberg and G. E. Henter, “The case for translation-invariant self-attention in Transformer-based language models,” in Proc. ACL-IJCNLP Vol. 2, 2021, pp. 130–140.
- O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.
- P. Jonell, T. Kucherenko, G. E. Henter, and J. Beskow, “Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings,” in Proc. IVA, 2020.
- T. Kucherenko, P. Jonell, Y. Yoon, P. Wolfert, and G. E. Henter, “A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020,” in Proc. IUI, 2021, pp. 11–21.
- T. Kucherenko, P. Wolfert, Y. Yoon, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “Evaluating gesture-generation in a large-scale open challenge: The GENEA Challenge 2022,” arXiv preprint arXiv:2303.08737, 2023.
- Y. Ferstl, M. Neff, and R. McDonnell, “ExpressGesture: Expressive gesture generation from speech through database matching,” Comput. Animat. Virt. W., p. e2016, 2021.
- ——, “Adversarial gesture generation with realistic gesture phasing,” Comput. Graph., vol. 89, pp. 117–130, 2020.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17 022–17 033.
- R. Prenger, R. Valle et al., “WaveGlow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, 2019.
- S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
- J. Taylor and K. Richmond, “Confidence intervals for ASR-based TTS evaluation,” in Proc. Interspeech, 2021.
- S. Wang, G. E. Henter, J. Gustafson, and É. Székely, “On the use of self-supervised speech representations in spontaneous speech synthesis,” in Proc. SSW, 2023.
- A. Deichler, S. Mehta, S. Alexanderson, and J. Beskow, “Diffusion-based co-speech gesture generation using joint text and audio representation,” in Proc. ICMI, 2023.