Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis (2404.19622v1)

Published 30 Apr 2024 in cs.HC, cs.CV, cs.GR, cs.SD, and eess.AS

Abstract: Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See https://shivammehta25.github.io/MAGI/ for example output.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. Synthetic dialogue dataset generation using llm agents. arXiv preprint arXiv:2401.17461, 2024.
  2. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Proc. EMNLP, pages 1884–1895, 2020.
  4. Simon Alexanderson. The StyleGestures entry to the GENEA Challenge 2020. In Proc. GENEA Workshop, 2020.
  5. Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum, 39(2):487–496, 2020a.
  6. Generating coherent spontaneous speech and gesture from text. In Proc. IVA, pages 1–3, 2020b.
  7. Listen, denoise, action! Audio-driven motion synthesis with diffusion models. ACM Trans. Graph., 42(4):1–20, 2023.
  8. GestureDiffuCLIP: Gesture diffusion model with CLIP latents. ACM Trans. Graph., 42(4):1–18, 2023.
  9. data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Machine Learning, pages 1298–1312, 2022.
  10. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  11. James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
  12. Improving image generation with better captions, 2023.
  13. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901, 2020.
  15. Embodied conversational agents. MIT press, 2000.
  16. Coqui.ai. xTTS - TTS 0.22.0 documentation, 2023.
  17. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  18. Diffusion-based co-speech gesture generation using joint text and audio representation. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 755–762, 2023a.
  19. Learning to generate pointing gestures in situated embodied conversational agents. Frontiers in Robotics and AI, 10:1110534, 2023b.
  20. Investigating the use of recurrent motion modelling for speech gesture generation. In Proc. IVA, pages 93–98, 2018.
  21. Multi-task learning for continuous control of non-verbal behaviour in humanoid social robots. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 411–420. IEEE, 2019.
  22. Adversarial gesture generation with realistic gesture phasing. Comput. Graph., 89:117–130, 2020.
  23. ExpressGesture: Expressive gesture generation from speech through database matching. Comput. Animat. Virt. W., page e2016, 2021.
  24. ZeroEGGS: Zero-shot example-based gesture generation from speech. Comput. Graph. Forum, 42(1):206–216, 2023.
  25. F. Sebastian Grassia. Practical parameterization of rotations using the exponential map. J. Graph. Tool., 3(3):29–48, 1998.
  26. VoiceFlow: Efficient text-to-speech with rectified flow matching. In Proc. ICASSP, 2024.
  27. Learning speech-driven 3D conversational gestures from video. In Proceedings of the International Conference on Intelligent Virtual Agents, pages 101–108, 2021.
  28. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  29. Large language models in textual analysis for gesture selection. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 378–387, 2023.
  30. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  31. Autumn B Hostetter. When do gestures communicate? a meta-analysis. Psychological Bulletin, 133(2):297, 2007.
  32. Motion flow matching for human motion synthesis and editing. arXiv preprint arXiv:2312.08895, 2023.
  33. ITU-T P.800. Methods for subjective determination of transmission quality. Standard, ITU, 1996.
  34. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proc. IVA, 2020.
  35. Adam Kendon. How gestures can become like words. In Cross-Cultural Perspectives in Nonverbal Communication. C. J. Hogrefe, Inc., 1988.
  36. Neural style-preserving visual dubbing. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 2535–2545, 2019.
  37. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. In Proc. NeurIPS, pages 8067–8077, 2020.
  38. VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. ICML, pages 5530–5540, 2021.
  39. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
  40. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. NeurIPS, pages 17022–17033, 2020.
  41. VITS2: Improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design. In Proc. Interspeech, pages 4374–4378, 2023.
  42. Synthesizing multimodal utterances for conversational agents. In Computer Animation and Virtual Worlds, pages 39–52. Wiley Online Library, 2004.
  43. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Intelligent Virtual Agents, pages 242–250, 2020.
  44. Multimodal analysis of the predictability of hand-gesture properties. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pages 770–779, 2022.
  45. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the International Conference on Multimodal Interaction, pages 792–801, 2023a.
  46. Evaluating gesture-generation in a large-scale open challenge: The GENEA Challenge 2022. arXiv preprint arXiv:2303.08737, 2023b.
  47. Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv, 2024.
  48. Prosody-controllable spontaneous TTS with neural HMMs. In Proc. ICASSP, 2023.
  49. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
  50. Talking With Hands 16.2 M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 763–772, 2019.
  51. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, pages 6706–6713, 2019.
  52. Flow matching for generative modeling. In Proc. ICLR, 2023.
  53. Simple and effective unsupervised speech synthesis. In Proc. Interspeech, pages 843–847, 2022a.
  54. BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Proceedings of the European Conference on Computer Vision, pages 612–630, 2022b.
  55. Speech-based gesture generation for robots and embodied agents: A scoping review. In Proceedings of the International Conference on Human-Agent Interaction, pages 31–38, 2021.
  56. Co-speech gesture synthesis using discrete gesture token learning. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9808–9815. IEEE, 2023.
  57. Learning the joint distribution of two sequences using little or no paired data. arXiv preprint arXiv:2212.03232, 2022.
  58. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proc. Interspeech 2017, pages 498–502, 2017.
  59. David McNeill. Hand and mind: What gestures reveal about thought. University of Chicago Press, 1992.
  60. David McNeill. Gesture and Thought. University of Chicago Press, 2008.
  61. Neural HMMs are all you need (for high-quality attention-free TTS). In Proc. ICASSP, pages 7457–7461, 2022.
  62. OverFlow: Putting flows on top of neural transducers for better TTS. In Proc. Interspeech, 2023a.
  63. Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In Proc. SSW, 2023b.
  64. Unified speech and gesture synthesis using flow matching. In Proc. ICASSP, 2024a.
  65. Matcha-TTS: A fast TTS architecture with conditional flow matching. In Proc. ICASSP, 2024b.
  66. SynVox2: Towards a privacy-friendly VoxCeleb2 dataset. In Proc. ICASSP, pages 11421–11425, 2024.
  67. From audio to photoreal embodiment: Synthesizing humans in conversations. arXiv preprint arXiv:2401.01885, 2024.
  68. Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. In Proc. Interspeech, pages 461–465, 2022.
  69. A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum, 2023.
  70. Stochastic pitch prediction improves the diversity and naturalness of speech in Glow-TTS. In Proc. Interspeech, 2023.
  71. Modeling and animating conversational agents. In Adaptive hypertext and hypermedia, pages 21–30. Springer, 1996.
  72. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, pages 8599–8608, 2021.
  73. The Blizzard Challenge 2013–Indian language task. In Proceedings of the Blizzard Challenge Workshop, 2013.
  74. Improving language understanding by generative pre-training, 2018.
  75. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763, 2021.
  76. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, pages 28492–28518, 2023.
  77. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  78. Passing a non-verbal Turing test: Evaluating gesture animations generated from speech. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces, pages 573–581, 2021.
  79. FastSpeech 2: Fast and high-quality end-to-end text to speech. In Proc. ICLR, 2021.
  80. Revisiting over-smoothness in text to speech. In Proc. ACL, pages 8197–8213, 2022.
  81. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pages 10684–10695, 2022.
  82. Towards an integrated model of speech and gesture production for multi-modal robot behavior. In Proc. RO-MAN, pages 614–619, 2010.
  83. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proc. ICASSP, pages 4779–4783, 2018.
  84. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  85. Hubert Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. arXiv preprint arXiv:2306.00814, 2023.
  86. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021.
  87. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
  88. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
  89. Paul Taylor. Text-to-speech synthesis. Cambridge University Press, 2009.
  90. Speech-driven conversational agents using conditional Flow-VAEs. In Proceedings of the ACM European Conference on Visual Media Production, pages 6:1–6:9, 2021.
  91. Human motion diffusion model. In Proceedings of the International Conference on Learning Representations, 2023.
  92. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  93. Boris van Breugel and Mihaela van der Schaar. Beyond privacy: Navigating the opportunities and challenges of synthetic data. arXiv preprint arXiv:2304.03722, 2023.
  94. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proceedings of the International Conference on Machine Learning, pages 3918–3926, 2018.
  95. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 109–117, 2017.
  96. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  97. Gesture and speech in interaction: An overview. Speech Communication, 57:209–232, 2014.
  98. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  99. Integrated speech and gesture synthesis. In Proc. ICMI, pages 177–185, 2021.
  100. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics, 10(3):228, 2021a.
  101. Probabilistic human-like gesture synthesis from speech using GRU-based WGAN. In Companion Publication of the International Conference on Multimodal Interaction, pages 194–201, 2021b.
  102. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019.
  103. Diffusestylegesture: stylized audio-driven co-speech gesture generation with diffusion models. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 5860–5868, 2023a.
  104. QPGesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2321–2330, 2023b.
  105. The diffusestylegesture+ entry to the genea challenge 2023. In Proceedings of the International Conference on Multimodal Interaction, pages 779–785, 2023c.
  106. Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6724–6728. IEEE, 2020.
  107. Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022.
  108. Generating holistic 3D human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
  109. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM T. Graphic., 39(6):222:1–222:16, 2020.
  110. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the International Conference on Multimodal Interaction, pages 736–747, 2022.
  111. DurIAN: Duration informed attention network for multimodal synthesis. In Proc. Interspeech, pages 2027–2031, 2020.
  112. DiffMotion: Speech-driven gesture synthesis using denoising diffusion model. In Proc. MMM, pages 231–242, 2023a.
  113. Motion synthesis and editing in low-dimensional spaces. Computer Graphics Forum, 39(8):509–521, 2020.
  114. MotionDiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  115. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv e-prints, pages arXiv–2303, 2023b.
  116. LIMA: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  117. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023.
  118. Adrian Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com