Papers
Topics
Authors
Recent
2000 character limit reached

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts (2410.23836v1)

Published 31 Oct 2024 in cs.CV

Abstract: This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, LLM priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in MM, 2020, pp. 484–492.
  2. H.-K. Song, S. H. Woo, J. Lee, S. Yang, H. Cho, Y. Lee, D. Choi, and K.-w. Kim, “Talking face generation with multilingual tts,” in CVPR, 2022, pp. 21 425–21 430.
  3. Y. Sun, H. Zhou, K. Wang, Q. Wu, Z. Hong, J. Liu, E. Ding, J. Wang, Z. Liu, and K. Hideki, “Masked lip-sync prediction by audio-visual contextual exploitation in transformers,” in SIGGRAPH Asia, 2022, pp. 1–9.
  4. K. Cheng, X. Cun, Y. Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “Videoretalking: Audio-based lip synchronization for talking head video editing in the wild,” in SIGGRAPH Asia, 2022, pp. 1–9.
  5. S. Bounareli, C. Tzelepis, V. Argyriou, I. Patras, and G. Tzimiropoulos, “Hyperreenact: one-shot reenactment via jointly learning to refine and retarget faces,” in ICCV, 2023, pp. 7149–7159.
  6. G.-S. Hsu, C.-H. Tsai, and H.-Y. Wu, “Dual-generator face reenactment,” in CVPR, 2022, pp. 642–650.
  7. Z. Yu, Z. Yin, D. Zhou, D. Wang, F. Wong, and B. Wang, “Talking head generation with probabilistic audio-to-visual diffusion priors,” in ICCV, 2023, pp. 7645–7655.
  8. W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in CVPR, 2023, pp. 8652–8661.
  9. S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” in CVPR, 2023, pp. 1982–1991.
  10. E. Corona, A. Zanfir, E. G. Bazavan, N. Kolotouros, T. Alldieck, and C. Sminchisescu, “Vlogger: Multimodal diffusion for embodied avatar synthesis,” arXiv preprint arXiv:2403.08764, 2024.
  11. L. Hu, X. Gao, P. Zhang, K. Sun, B. Zhang, and L. Bo, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” arXiv preprint arXiv:2311.17117, 2023.
  12. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
  13. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” ICLR, 2017.
  14. W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, pp. 1–39, 2022.
  15. Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, and P. Luo, “Raphael: Text-to-image generation via large mixture of diffusion paths,” in Nips, 2023, pp. 1–10.
  16. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2014.
  17. X. Liu, Q. Wu, H. Zhou, Y. Du, W. Wu, D. Lin, and Z. Liu, “Audio-driven co-speech gesture video generation,” 2022, pp. 21 386–21 399.
  18. T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” TOG, pp. 1–19, 2022.
  19. H. Liu, N. Iwamoto, Z. Zhu, Z. Li, Y. Zhou, E. Bozkurt, and B. Zheng, “Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis,” in MM, 2022, pp. 3764–3773.
  20. D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive video models,” in ICLR, 2019, pp. 1–8.
  21. C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, “Nüwa: Visual synthesis pre-training for neural visual world creation,” in ECCV, 2022, pp. 720–736.
  22. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in ICLR, 2018.
  23. J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari, “Stochastic latent residual video prediction,” in ICML, 2020, pp. 3233–3246.
  24. L. Castrejon, N. Ballas, and A. Courville, “Improved conditional vrnns for video prediction,” in ICCV, 2019, pp. 7608–7617.
  25. A. Blattmann, T. Milbich, M. Dorkenwald, and B. Ommer, “ipoke: Poking a still image for controlled stochastic video synthesis,” in ICCV, 2021, pp. 14 707–14 717.
  26. M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. G. Derpanis, and B. Ommer, “Stochastic image-to-video synthesis using cinns,” in CVPR, 2021, pp. 3742–3753.
  27. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in ICLR, 2017, pp. 1–8.
  28. M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan,” IJCV, pp. 2586–2606, 2020.
  29. I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” in CVPR, 2022, pp. 3626–3636.
  30. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023, pp. 22 563–22 575.
  31. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv, 2023.
  32. X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” in SIGGRAPH, 2022, pp. 1–10.
  33. V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Mcvd-masked conditional video diffusion for prediction, generation, and interpolation,” in Nips, 2022, pp. 23 371–23 385.
  34. W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood, “Flexible diffusion modeling of long videos,” in Nips, 2022, pp. 27 953–27 965.
  35. S. Zhu, J. L. Chen, Z. Dai, Y. Xu, X. Cao, Y. Yao, H. Zhu, and S. Zhu, “Champ: Controllable and consistent human image animation with 3d parametric guidance,” in ECCV, 2024.
  36. Y. Zhang, J. Gu, L.-W. Wang, H. Wang, J. Cheng, Y. Zhu, and F. Zou, “Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,” arXiv preprint arXiv:2406.19680, 2024.
  37. R. Shao, Y. Pang, Z. Zheng, J. Sun, and Y. Liu, “Human4dit: Free-view human video generation with 4d diffusion transformer,” TOG, 2024.
  38. S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R. Thórisson, and H. Vilhjálmsson, “Towards a common framework for multimodal generation: The behavior markup language,” in IVA, 2006, pp. 205–217.
  39. J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone, “Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 413–420.
  40. T. Kucherenko, P. Jonell, S. Van Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellström, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” in ICMI, 2020, pp. 242–250.
  41. Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in ICRA, 2019, pp. 4303–4309.
  42. C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in 3DV, 2019, pp. 719–728.
  43. R. Daněček, K. Chhatre, S. Tripathi, Y. Wen, M. Black, and T. Bolkart, “Emotional speech-driven animation with content-emotion disentanglement,” in SIGGRAPH Asia, 2023, pp. 1–13.
  44. H. Liu, Z. Zhu, N. Iwamoto, Y. Peng, Z. Li, Y. Zhou, E. Bozkurt, and B. Zheng, “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in ECCV, 2022, pp. 612–630.
  45. H. Liu, Z. Zhu, G. Becherini, Y. Peng, M. Su, Y. Zhou, N. Iwamoto, B. Zheng, and M. J. Black, “Emage: Towards unified holistic co-speech gesture generation via masked audio gesture modeling,” 2024, pp. 612–630.
  46. X. Liu, Q. Wu, H. Zhou, Y. Xu, R. Qian, X. Lin, X. Zhou, W. Wu, B. Dai, and B. Zhou, “Learning hierarchical cross-modal association for co-speech gesture generation,” in CVPR, 2022, pp. 10 462–10 472.
  47. K. Pang, D. Qin, Y. Fan, J. Habekost, T. Shiratori, J. Yamagishi, and T. Komura, “Bodyformer: Semantics-guided 3d body gesture synthesis with transformer,” TOG, pp. 1–12, 2023.
  48. X. Qi, J. Pan, P. Li, R. Yuan, X. Chi, M. Li, W. Luo, W. Xue, S. Zhang, Q. Liu et al., “Weakly-supervised emotion transition learning for diverse 3d co-speech gesture generation,” in CVPR, 2024.
  49. Y. Yoon, P. Wolfert, T. Kucherenko, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “The genea challenge 2022: A large evaluation of data-driven co-speech gesture generation,” in ICMI, 2022, pp. 736–747.
  50. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in CVPR, 2019, pp. 3497–3506.
  51. Y. Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics, pp. 1–16, 2020.
  52. L. Zhu, X. Liu, X. Liu, R. Qian, Z. Liu, and L. Yu, “Taming diffusion models for audio-driven co-speech gesture generation,” in CVPR, 2023, pp. 10 544–10 553.
  53. T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” TOG, pp. 1–18, 2023.
  54. J. Li, D. Kang, W. Pei, X. Zhe, Y. Zhang, Z. He, and L. Bao, “Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders,” in ICCV, 2021, pp. 11 293–11 302.
  55. S. Qian, Z. Tu, Y. Zhi, W. Liu, and S. Gao, “Speech drives templates: Co-speech gesture synthesis with learned templates,” in ICCV, 2021, pp. 11 077–11 086.
  56. M. Sun, M. Zhao, Y. Hou, M. Li, H. Xu, S. Xu, and J. Hao, “Co-speech gesture synthesis by reinforcement learning with contrastive pre-trained rewards,” in CVPR, 2023, pp. 2331–2340.
  57. S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, and H. Zhuang, “Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation,” in CVPR, 2023, pp. 2321–2330.
  58. S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” TOG, pp. 1–20, 2023.
  59. X. He, Q. Huang, Z. Zhang, Z. Lin, Z. Wu, S. Yang, M. Li, Z. Chen, S. Xu, and X. Wu, “Co-speech gesture video generation via motion-decoupled diffusion model,” in CVPR, 2024.
  60. Y. Liu, L. Lin, F. Yu, C. Zhou, and Y. Li, “Moda: Mapping-once audio-driven portrait animation with dual attentions,” in ICCV, 2023, pp. 23 020–23 029.
  61. Z. Guo and J. Zhang, “Fasttalker: Jointly generating speech and conversational gestures from text,” arXiv preprint arXiv:2409.16404, 2024.
  62. Z. Zhang, T. Ao, Y. Zhang, Q. Gao, C. Lin, B. Chen, and L. Liu, “Semantic gesticulator: Semantics-aware co-speech gesture synthesis,” TOG, 2024.
  63. S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” pp. 1–8, 2023.
  64. B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,” in Nips, 2023, pp. 1–8.
  65. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Nips, pp. 12 449–12 460, 2020.
  66. A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Nips, pp. 1–8, 2017.
  67. H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in CVPR, 2023, pp. 469–480.
  68. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, pp. 1–67, 2020.
  69. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, pp. 1–53, 2024.
  70. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022, pp. 1–10.
  71. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Nips, pp. 6840–6851, 2020.
  72. Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” 2024.
  73. Z. Yang, A. Zeng, C. Yuan, and Y. Li, “Effective whole-body pose estimation with two-stages distillation,” in ICCV, 2023, pp. 4210–4220.
  74. S. Lin, L. Yang, I. Saleemi, and S. Sengupta, “Robust high-resolution video matting with temporal guidance,” in WACV, 2022, pp. 238–247.
  75. C. Ahuja, D. W. Lee, Y. I. Nakano, and L.-P. Morency, “Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,” in ECCV, 2020, pp. 248–265.
  76. Y. Jafarian and H. S. Park, “Learning high fidelity depths of dressed humans by watching social media dance videos,” in CVPR, 2021, pp. 12 753–12 762.
  77. P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network for pose-guided human video generation,” in BMVC, 2019.
  78. B. Porgali, V. Albiero, J. Ryda, C. C. Ferrer, and C. Hazirbas, “The casual conversations v2 dataset,” in CVPRW, 2023, pp. 10–17.
  79. Z. Xiong, C. Li, K. Liu, H. Liao, J. Hu, J. Zhu, S. Ning, L. Qiu, C. Wang, S. Wang et al., “Mvhumannet: A large-scale dataset of multi-view daily dressing human captures,” in CVPR, 2024, pp. 19 801–19 811.
  80. Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in CVPR, 2021, pp. 3661–3670.
  81. H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,” arXiv preprint arXiv:2403.17694, 2024.
  82. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
  83. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Nips, pp. 1–8, 2017.
  84. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
  85. J. S. Yoon, D. Ceylan, T. Y. Wang, J. Lu, J. Yang, Z. Shu, and H. S. Park, “Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera,” in CVPR, 2022, pp. 3407–3417.
  86. Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” TOG, pp. 1–15, 2020.
  87. S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, M. Cheng, and L. Xiao, “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,” in IJCAI, 2023.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.