Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts (2410.23836v1)
Abstract: This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, LLM priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.
- K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in MM, 2020, pp. 484–492.
- H.-K. Song, S. H. Woo, J. Lee, S. Yang, H. Cho, Y. Lee, D. Choi, and K.-w. Kim, “Talking face generation with multilingual tts,” in CVPR, 2022, pp. 21 425–21 430.
- Y. Sun, H. Zhou, K. Wang, Q. Wu, Z. Hong, J. Liu, E. Ding, J. Wang, Z. Liu, and K. Hideki, “Masked lip-sync prediction by audio-visual contextual exploitation in transformers,” in SIGGRAPH Asia, 2022, pp. 1–9.
- K. Cheng, X. Cun, Y. Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “Videoretalking: Audio-based lip synchronization for talking head video editing in the wild,” in SIGGRAPH Asia, 2022, pp. 1–9.
- S. Bounareli, C. Tzelepis, V. Argyriou, I. Patras, and G. Tzimiropoulos, “Hyperreenact: one-shot reenactment via jointly learning to refine and retarget faces,” in ICCV, 2023, pp. 7149–7159.
- G.-S. Hsu, C.-H. Tsai, and H.-Y. Wu, “Dual-generator face reenactment,” in CVPR, 2022, pp. 642–650.
- Z. Yu, Z. Yin, D. Zhou, D. Wang, F. Wong, and B. Wang, “Talking head generation with probabilistic audio-to-visual diffusion priors,” in ICCV, 2023, pp. 7645–7655.
- W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in CVPR, 2023, pp. 8652–8661.
- S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” in CVPR, 2023, pp. 1982–1991.
- E. Corona, A. Zanfir, E. G. Bazavan, N. Kolotouros, T. Alldieck, and C. Sminchisescu, “Vlogger: Multimodal diffusion for embodied avatar synthesis,” arXiv preprint arXiv:2403.08764, 2024.
- L. Hu, X. Gao, P. Zhang, K. Sun, B. Zhang, and L. Bo, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” arXiv preprint arXiv:2311.17117, 2023.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
- N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” ICLR, 2017.
- W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, pp. 1–39, 2022.
- Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, and P. Luo, “Raphael: Text-to-image generation via large mixture of diffusion paths,” in Nips, 2023, pp. 1–10.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2014.
- X. Liu, Q. Wu, H. Zhou, Y. Du, W. Wu, D. Lin, and Z. Liu, “Audio-driven co-speech gesture video generation,” 2022, pp. 21 386–21 399.
- T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” TOG, pp. 1–19, 2022.
- H. Liu, N. Iwamoto, Z. Zhu, Z. Li, Y. Zhou, E. Bozkurt, and B. Zheng, “Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis,” in MM, 2022, pp. 3764–3773.
- D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive video models,” in ICLR, 2019, pp. 1–8.
- C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, “Nüwa: Visual synthesis pre-training for neural visual world creation,” in ECCV, 2022, pp. 720–736.
- M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in ICLR, 2018.
- J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari, “Stochastic latent residual video prediction,” in ICML, 2020, pp. 3233–3246.
- L. Castrejon, N. Ballas, and A. Courville, “Improved conditional vrnns for video prediction,” in ICCV, 2019, pp. 7608–7617.
- A. Blattmann, T. Milbich, M. Dorkenwald, and B. Ommer, “ipoke: Poking a still image for controlled stochastic video synthesis,” in ICCV, 2021, pp. 14 707–14 717.
- M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. G. Derpanis, and B. Ommer, “Stochastic image-to-video synthesis using cinns,” in CVPR, 2021, pp. 3742–3753.
- R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in ICLR, 2017, pp. 1–8.
- M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan,” IJCV, pp. 2586–2606, 2020.
- I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” in CVPR, 2022, pp. 3626–3636.
- A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023, pp. 22 563–22 575.
- A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv, 2023.
- X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” in SIGGRAPH, 2022, pp. 1–10.
- V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Mcvd-masked conditional video diffusion for prediction, generation, and interpolation,” in Nips, 2022, pp. 23 371–23 385.
- W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood, “Flexible diffusion modeling of long videos,” in Nips, 2022, pp. 27 953–27 965.
- S. Zhu, J. L. Chen, Z. Dai, Y. Xu, X. Cao, Y. Yao, H. Zhu, and S. Zhu, “Champ: Controllable and consistent human image animation with 3d parametric guidance,” in ECCV, 2024.
- Y. Zhang, J. Gu, L.-W. Wang, H. Wang, J. Cheng, Y. Zhu, and F. Zou, “Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,” arXiv preprint arXiv:2406.19680, 2024.
- R. Shao, Y. Pang, Z. Zheng, J. Sun, and Y. Liu, “Human4dit: Free-view human video generation with 4d diffusion transformer,” TOG, 2024.
- S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R. Thórisson, and H. Vilhjálmsson, “Towards a common framework for multimodal generation: The behavior markup language,” in IVA, 2006, pp. 205–217.
- J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone, “Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 413–420.
- T. Kucherenko, P. Jonell, S. Van Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellström, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” in ICMI, 2020, pp. 242–250.
- Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in ICRA, 2019, pp. 4303–4309.
- C. Ahuja and L.-P. Morency, “Language2pose: Natural language grounded pose forecasting,” in 3DV, 2019, pp. 719–728.
- R. Daněček, K. Chhatre, S. Tripathi, Y. Wen, M. Black, and T. Bolkart, “Emotional speech-driven animation with content-emotion disentanglement,” in SIGGRAPH Asia, 2023, pp. 1–13.
- H. Liu, Z. Zhu, N. Iwamoto, Y. Peng, Z. Li, Y. Zhou, E. Bozkurt, and B. Zheng, “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in ECCV, 2022, pp. 612–630.
- H. Liu, Z. Zhu, G. Becherini, Y. Peng, M. Su, Y. Zhou, N. Iwamoto, B. Zheng, and M. J. Black, “Emage: Towards unified holistic co-speech gesture generation via masked audio gesture modeling,” 2024, pp. 612–630.
- X. Liu, Q. Wu, H. Zhou, Y. Xu, R. Qian, X. Lin, X. Zhou, W. Wu, B. Dai, and B. Zhou, “Learning hierarchical cross-modal association for co-speech gesture generation,” in CVPR, 2022, pp. 10 462–10 472.
- K. Pang, D. Qin, Y. Fan, J. Habekost, T. Shiratori, J. Yamagishi, and T. Komura, “Bodyformer: Semantics-guided 3d body gesture synthesis with transformer,” TOG, pp. 1–12, 2023.
- X. Qi, J. Pan, P. Li, R. Yuan, X. Chi, M. Li, W. Luo, W. Xue, S. Zhang, Q. Liu et al., “Weakly-supervised emotion transition learning for diverse 3d co-speech gesture generation,” in CVPR, 2024.
- Y. Yoon, P. Wolfert, T. Kucherenko, C. Viegas, T. Nikolov, M. Tsakov, and G. E. Henter, “The genea challenge 2022: A large evaluation of data-driven co-speech gesture generation,” in ICMI, 2022, pp. 736–747.
- S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in CVPR, 2019, pp. 3497–3506.
- Y. Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics, pp. 1–16, 2020.
- L. Zhu, X. Liu, X. Liu, R. Qian, Z. Liu, and L. Yu, “Taming diffusion models for audio-driven co-speech gesture generation,” in CVPR, 2023, pp. 10 544–10 553.
- T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” TOG, pp. 1–18, 2023.
- J. Li, D. Kang, W. Pei, X. Zhe, Y. Zhang, Z. He, and L. Bao, “Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders,” in ICCV, 2021, pp. 11 293–11 302.
- S. Qian, Z. Tu, Y. Zhi, W. Liu, and S. Gao, “Speech drives templates: Co-speech gesture synthesis with learned templates,” in ICCV, 2021, pp. 11 077–11 086.
- M. Sun, M. Zhao, Y. Hou, M. Li, H. Xu, S. Xu, and J. Hao, “Co-speech gesture synthesis by reinforcement learning with contrastive pre-trained rewards,” in CVPR, 2023, pp. 2331–2340.
- S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, and H. Zhuang, “Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation,” in CVPR, 2023, pp. 2321–2330.
- S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” TOG, pp. 1–20, 2023.
- X. He, Q. Huang, Z. Zhang, Z. Lin, Z. Wu, S. Yang, M. Li, Z. Chen, S. Xu, and X. Wu, “Co-speech gesture video generation via motion-decoupled diffusion model,” in CVPR, 2024.
- Y. Liu, L. Lin, F. Yu, C. Zhou, and Y. Li, “Moda: Mapping-once audio-driven portrait animation with dual attentions,” in ICCV, 2023, pp. 23 020–23 029.
- Z. Guo and J. Zhang, “Fasttalker: Jointly generating speech and conversational gestures from text,” arXiv preprint arXiv:2409.16404, 2024.
- Z. Zhang, T. Ao, Y. Zhang, Q. Gao, C. Lin, B. Chen, and L. Liu, “Semantic gesticulator: Semantics-aware co-speech gesture synthesis,” TOG, 2024.
- S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” pp. 1–8, 2023.
- B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,” in Nips, 2023, pp. 1–8.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Nips, pp. 12 449–12 460, 2020.
- A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Nips, pp. 1–8, 2017.
- H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in CVPR, 2023, pp. 469–480.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, pp. 1–67, 2020.
- H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, pp. 1–53, 2024.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022, pp. 1–10.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Nips, pp. 6840–6851, 2020.
- Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” 2024.
- Z. Yang, A. Zeng, C. Yuan, and Y. Li, “Effective whole-body pose estimation with two-stages distillation,” in ICCV, 2023, pp. 4210–4220.
- S. Lin, L. Yang, I. Saleemi, and S. Sengupta, “Robust high-resolution video matting with temporal guidance,” in WACV, 2022, pp. 238–247.
- C. Ahuja, D. W. Lee, Y. I. Nakano, and L.-P. Morency, “Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach,” in ECCV, 2020, pp. 248–265.
- Y. Jafarian and H. S. Park, “Learning high fidelity depths of dressed humans by watching social media dance videos,” in CVPR, 2021, pp. 12 753–12 762.
- P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network for pose-guided human video generation,” in BMVC, 2019.
- B. Porgali, V. Albiero, J. Ryda, C. C. Ferrer, and C. Hazirbas, “The casual conversations v2 dataset,” in CVPRW, 2023, pp. 10–17.
- Z. Xiong, C. Li, K. Liu, H. Liao, J. Hu, J. Zhu, S. Ning, L. Qiu, C. Wang, S. Wang et al., “Mvhumannet: A large-scale dataset of multi-view daily dressing human captures,” in CVPR, 2024, pp. 19 801–19 811.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in CVPR, 2021, pp. 3661–3670.
- H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,” arXiv preprint arXiv:2403.17694, 2024.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Nips, pp. 1–8, 2017.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
- J. S. Yoon, D. Ceylan, T. Y. Wang, J. Lu, J. Yang, Z. Shu, and H. S. Park, “Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera,” in CVPR, 2022, pp. 3407–3417.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” TOG, pp. 1–15, 2020.
- S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, M. Cheng, and L. Xiao, “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,” in IJCAI, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.