AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents
Abstract: In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents.
- Can computer personalities be human personalities? International Journal of Human-Computer Studies, 43(2):223–239, 1995.
- Autogpt: The heart of open-source agent ecosystem. https://github.com/Significant-Gravitas/AutoGPT#-autogpt-the-heart-of-the-open-source-agent-ecosystem, 2023.
- To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations, 2019.
- N. Badler. Virtual humans for animation, ergonomics, and simulation. In Proceedings IEEE Nonrigid and Articulated Motion Workshop, pages 28–36, 1997.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Facilitating multiparty dialog with gaze, gesture, and speech. August 2010.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
- Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994.
- A realistic face-to-face conversation system based on deep neural networks. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2575–2583, 2019.
- A face-to-face neural conversation model. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7113–7121, 2018.
- Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent. Companion Publication of the 25th International Conference on Multimodal Interaction, 2023.
- Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
- Learn2Smile: Learning non-verbal interaction through observation, page 4131–4138. IEEE, Sep 2017.
- Affective faces for goal-driven dyadic communication. 2023.
- Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Virtual rapport. In Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier, editors, Intelligent Virtual Agents, pages 14–27, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- Predicting head pose in dyadic conversation. In Jonas Beskow, Christopher Peters, Ginevra Castellano, Carol O’Sullivan, Iolanda Leite, and Stefan Kopp, editors, Intelligent Virtual Agents, pages 160–169, Cham, 2017. Springer International Publishing.
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Language models represent space and time, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Virtual rapport 2.0. In Hannes Högni Vilhjálmsson, Stefan Kopp, Stacy Marsella, and Kristinn R. Thórisson, editors, Intelligent Virtual Agents, pages 68–79, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
- Dyadgan: Generating facial expressions in dyadic interactions. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2259–2266, 2017.
- Using an interactive avatar’s facial expressiveness to increase persuasiveness and socialness. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, page 1719–1728, New York, NY, USA, 2015. Association for Computing Machinery.
- Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
- A roadmap of agent research and development. Autonomous Agents and Multi-Agent Systems, 1(1):7–38, jan 1998.
- Audio-driven emotional video portraits. arXiv preprint arXiv:2104.07452, 2021.
- Learning non-verbal behavior for a social robot from youtube videos. 2019.
- Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Aff-wild2: Extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770, 2018.
- Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction, 2020.
- Social interaction with agents and avatars in immersive virtual environments: A survey. Frontiers in Virtual Reality, 2, 2022.
- Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957, 2017.
- The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pages 94–101. IEEE, 2010.
- Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022.
- Too real for comfort? uncanny responses to computer generated faces. Computers in Human Behavior, 25(3):695–710, 2009.
- Disfa: A spontaneous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20395–20405, June 2022.
- Can language models learn to listen? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10083–10093, October 2023.
- A systematic review of social presence: Definition, antecedents, and implications. Frontiers in Robotics and AI, 5, 2018.
- One-shot face re-enactment using hybrid latent spaces of stylegan2. 2023.
- Training language models to follow instructions with human feedback, 2022.
- Generative agents: Interactive simulacra of human behavior, 2023.
- Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
- Vladimir M. Petrović. Artificial intelligence and virtual worlds – toward human-level ai agents. IEEE Access, 6:39976–39988, 2018.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 484–492, New York, NY, USA, 2020. Association for Computing Machinery.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
- Advances in neural rendering. CoRR, abs/2111.05849, 2021.
- Agent vs. avatar: Comparing embodied conversational agents concerning characteristics of the uncanny valley. In 2020 IEEE International Conference on Human-Machine Systems (ICHMS), pages 1–6, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Neural discrete representation learning, 2018.
- Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17979–17989, June 2023.
- Voyager: An open-ended embodied agent with large language models, 2023.
- Building an emotionally responsive avatar with dynamic facial expressions in human—computer interactions. Multimodal Technologies and Interaction, 5(3), 2021.
- A survey on large language model based autonomous agents, 2023.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Humanoid agents: Platform for simulating human-like generative agents. arXiv preprint arXiv:2310.05418, 2023.
- Emergent abilities of large language models, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Reenactgan: Learning to reenact faces via boundary transfer. In ECCV, September 2018.
- The rise and potential of large language model based agents: A survey, 2023.
- Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. arXiv preprint arXiv:2206.07255, 2022.
- Raf-au database: In-the-wild facial expressions with subjective emotion judgement and objective au annotations. In Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020.
- Casme ii: An improved spontaneous micro-expression database and the baseline evaluation. PloS one, 9(1):e86041, 2014.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Celebv-text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023.
- Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7645–7655, October 2023.
- Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3867–3876, 2021.
- Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Responsive listening head generation: A benchmark dataset and baseline. In Computer Vision – ECCV 2022, pages 124–142, Cham, 2022. Springer Nature Switzerland.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.