Emotional Listener Portrait: Neural Listener Head Generation with Emotion (2310.00068v2)
Abstract: Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the explicit'' and
discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.
- Leaderboard. https://vico.solutions/.
- Speech-emotion-analyzer. https://github.com/MiteshPuthran/Speech-Emotion-Analyzer.
- To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International conference on multimodal interaction, pages 74–84, 2019.
- Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, pages 1–8, 2010.
- Windowed cross-correlation and peak picking for the analysis of variability in the association between behavioral time series. Psychological methods, 7(3):338, 2002.
- Real-time eye blink detection using facial landmarks. Cent. Mach. Perception, Dep. Cybern. Fac. Electr. Eng. Czech Tech. Univ. Prague, pages 1–8, 2016.
- Rapport with virtual agents: What do human social cues and personality explain? IEEE Transactions on Affective Computing, 8(3):382–395, 2016.
- High-fidelity face tracking for ar/vr via deep lighting adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13059–13069, 2021.
- Talking-head generation with rhythmic head motion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pages 35–51. Springer, 2020.
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
- Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326, 2020.
- Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
- Speech driven talking face generation from a single image and an emotion condition. IEEE Transactions on Multimedia, 2021.
- Affective faces for goal-driven dyadic communication. arXiv preprint arXiv:2301.10939, 2023.
- Predicting head pose in dyadic conversation. In Intelligent Virtual Agents: 17th International Conference, IVA 2017, Stockholm, Sweden, August 27-30, 2017, Proceedings 17, pages 160–169. Springer, 2017.
- Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- Forgerynet: A versatile benchmark for comprehensive forgery analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4360–4369, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Perceptual conversational head generation with regularized driver and enhanced renderer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 7050–7054, 2022.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
- Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pages 1–8, 2020.
- Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
- Audio-driven co-speech gesture video generation. arXiv preprint arXiv:2212.02350, 2022.
- Beth Logan et al. Mel frequency cepstral coefficients for music modeling. In Ismir, volume 270, page 11. Plymouth, MA, 2000.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Automated measurement of facial expression in infant–mother interaction: A pilot study. Infancy, 14(3):285–305, 2009.
- Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 2227–2231. IEEE, 2017.
- Fuzzy clustering of short time-series and unevenly distributed sampling points. In Advances in Intelligent Data Analysis V: 5th International Symposium on Intelligent Data Analysis, IDA 2003, Berlin, Germany, August 28-30, 2003. Proceedings 5, pages 330–340. Springer, 2003.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022.
- Interactive generative adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092, 2018.
- Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018.
- A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth annual conference of the international speech communication association, 2015.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
- Quantifying facial expression synchrony in face-to-face dyadic interactions: Temporal dynamics of simultaneously recorded facial emg signals. Journal of Nonverbal Behavior, 41:85–102, 2017.
- Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179, 2018.
- Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146:1–7, 2021.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.
- First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
- Emotion-controllable generalized talking face generation. arXiv preprint arXiv:2205.01155, 2022.
- Adaptive face forgery detection in cross domain (supplementary material).
- Face forgery detection via symmetric transformer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4102–4111, 2022.
- Tacr-net: Editing on deep video and voice portraits. In Proceedings of the 29th ACM International Conference on Multimedia, pages 478–486, 2021.
- Talking face video generation with editable expression. In Image and Graphics: 11th International Conference, ICIG 2021, Haikou, China, August 6–8, 2021, Proceedings, Part III 11, pages 753–764. Springer, 2021.
- Fsft-net: face transfer video generation with few-shot views. In 2021 IEEE International Conference on Image Processing (ICIP), pages 3582–3586. IEEE, 2021.
- Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
- Advanced lstm: A study about better time dependency modeling in emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. IEEE, 2018.
- Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
- Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128:1398–1413, 2020.
- Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
- Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1478–1486, 2021.
- From talking head to singing head: a significant enhancement for more natural human computer interaction. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 511–516. IEEE, 2017.
- Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3867–3876, 2021.
- Generating 3d people in scenes without people. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6194–6204, 2020.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
- Responsive listening head generation: A benchmark dataset and baseline. arXiv preprint arXiv:2112.13548, 2021.