Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation (2403.00274v2)

Published 1 Mar 2024 in cs.CV, cs.SD, and eess.AS

Abstract: Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, pages 1–8, 2010.
  2. Windowed cross-correlation and peak picking for the analysis of variability in the association between behavioral time series. Psychological methods, 7(3):338, 2002.
  3. Long-term human motion prediction with scene context. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 387–404. Springer, 2020.
  4. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. Applied Artificial Intelligence, 13(4-5):519–538, 1999.
  5. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 413–420, 1994.
  6. Libreface: An open-source toolkit for deep facial expression analysis. arXiv preprint arXiv:2308.10713, 2023.
  7. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
  8. High-fidelity face tracking for ar/vr via deep lighting adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13059–13069, 2021.
  9. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  10. Paul Ekman. Facial action coding system (facs). A human face, 2002.
  11. Learn2smile: Learning non-verbal interaction through observation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4131–4138. IEEE, 2017.
  12. Affective faces for goal-driven dyadic communication. arXiv preprint arXiv:2301.10939, 2023.
  13. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1396–1406, 2021.
  14. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  15. Perceptual conversational head generation with regularized driver and enhanced renderer. In Proceedings of the 30th ACM International Conference on Multimedia (MM’22), 2022.
  16. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
  17. Adam Kendon. Movement coordination in social interaction: Some examples described. Acta psychologica, 32:101–125, 1970.
  18. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
  19. Marianne LaFrance. Nonverbal synchrony and rapport: Analysis by the cross-lag panel technique. Social Psychology Quarterly, pages 66–70, 1979.
  20. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
  21. Mfr-net: Multi-faceted responsive listening head generation via denoising diffusion model. In Proceedings of the 31th ACM International Conference on Multimedia (MM’23), 2023.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Talkclip: Talking head generation with text-guided expressive speaking styles. arXiv preprint arXiv:2304.00334, 2023.
  25. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20395–20405, 2022.
  26. Interactive generative adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092, 2018.
  27. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  28. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  29. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  30. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
  31. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  32. Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2074–2083, 2018.
  33. Emotional listener portrait: Neural listener head generation with emotion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20839–20849, 2023.
  34. A conversational agent framework with multi-modal personality expression. ACM Transactions on Graphics (TOG), 40(1):1–16, 2021.
  35. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128:1398–1413, 2020.
  38. Longdancediff: Long-term dance generation with conditional diffusion model. arXiv preprint arXiv:2308.11945, 2023a.
  39. Synthesizing long-term human motions with diffusion models via coherent sampling. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3954–3964, 2023b.
  40. Synthesizing long-term human motions with diffusion models via coherent sampling. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3954–3964, 2023c.
  41. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
  42. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
  43. Responsive listening head generation: A benchmark dataset and baseline. In Proceedings of the European conference on computer vision (ECCV), 2022.

Summary

We haven't generated a summary for this paper yet.