Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model (2404.01862v1)
Abstract: Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
- No gestures left behind: Learning relationships between spoken language and freeform gestures. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1884–1895, 2020a.
- Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 248–265. Springer, 2020b.
- Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum, pages 487–496. Wiley Online Library, 2020.
- Synthesizing images of humans in unseen poses. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8340–8348, 2018.
- Fred L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, 11(6):567–585, 1989.
- Peter Bull. Gesture: Visible action as utterance. Journal of Language and Social Psychology, 25(3):339–341, 2006.
- Nonverbal behaviors, persuasion, and credibility. Human communication research, 17(1):140–169, 1990.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- A mixed finite element method for the biharmonic equation. In Mathematical aspects of finite elements in partial differential equations, pages 125–145. Elsevier, 1974.
- Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
- Zeroeggs: Zero-shot example-based gesture generation from speech. In Computer Graphics Forum, pages 206–216. Wiley Online Library, 2023.
- Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019.
- Show me what and tell me how: Video synthesis via multimodal conditioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3615–3625, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Li Hu et al. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023a.
- Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023b.
- Enhancing expressiveness in dance generation via integrating frequency and music style information. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8185–8189. IEEE, 2024.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- A large, crowdsourced evaluation of gesture generation systems on common data: The genea challenge 2020. In 26th international conference on intelligent user interfaces, pages 11–21, 2021.
- The genea challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 792–801, 2023.
- Learn to dance with aist++: Music conditioned 3d dance generation. arXiv preprint arXiv:2101.08779, 2(3), 2021.
- Finedance: A fine-grained choreography dataset for 3d full body dance generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10234–10243, 2023.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
- Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In European Conference on Computer Vision, pages 612–630. Springer, 2022a.
- Audio-driven co-speech gesture video generation. Advances in Neural Information Processing Systems, 35:21386–21399, 2022b.
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022c.
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022d.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
- Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18444–18455, 2023.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.
- Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11077–11086, 2021.
- Image inpainting with local and global refinement. IEEE Transactions on Image Processing, 31:2405–2420, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Robin Rombach et al. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
- mm-pose: Real-time human skeletal posture estimation using mmwave radars and cnns. IEEE Sensors Journal, 20(17):10032–10044, 2020.
- Motionet: 3d human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics (TOG), 40(1):1–15, 2020.
- Diverse video generation using a gaussian process trigger. arXiv preprint arXiv:2107.04619, 2021.
- First order motion model for image animation. Advances in neural information processing systems, 32, 2019.
- Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040, 2023.
- I3d-lstm: A new model for human action recognition. In IOP Conference Series: Materials Science and Engineering, page 032035. IOP Publishing, 2019.
- Detdiffusion: Synergizing generative and perceptive models for enhanced data generation and perception. arXiv preprint arXiv:2403.13304, 2024a.
- Explore 3d dance generation via reward model from automatically-ranked demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 301–309, 2024b.
- Hand gestures and verbal acknowledgments improve human-robot rapport. In Social Robotics: 9th International Conference, ICSR 2017, Tsukuba, Japan, November 22-24, 2017, Proceedings 9, pages 334–344. Springer, 2017.
- Image inpainting with learnable bidirectional attention maps. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8858–8867, 2019.
- Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677, 2022.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919, 2023a.
- Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023b.
- Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), pages 4303–4309. IEEE, 2019.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020.
- The genea challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction, pages 736–747, 2022.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
- Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022a.
- Audio-driven neural gesture reenactment with video motion graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3418–3428, 2022b.
- Xu He (66 papers)
- Qiaochu Huang (7 papers)
- Zhensong Zhang (14 papers)
- Zhiwei Lin (41 papers)
- Zhiyong Wu (171 papers)
- Sicheng Yang (20 papers)
- Minglei Li (19 papers)
- Zhiyi Chen (17 papers)
- Songcen Xu (41 papers)
- Xiaofei Wu (31 papers)