Digital Life Project: Autonomous 3D Characters with Social Intelligence (2312.04547v1)
Abstract: In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars, incorporates a reflection process based on psychology principles, and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching, a proven industry technique to ensure motion quality, with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively, they enable virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states. Concurrently, these characters can perform contextually relevant bodily movements. Additionally, a motion captioning module further allows the virtual character to recognize and appropriately respond to human players' actions. Homepage: https://digital-life-project.com/
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
- Social penetration: The development of interpersonal relationships. Holt, Rinehart & Winston, 1973.
- Out of one, many: Using language models to simulate human samples. Political Analysis, 2023.
- The form of the forgetting curve and the fate of memories. Journal of mathematical psychology, 2011.
- Social learning theory. Englewood cliffs Prentice Hall, 1977.
- The handbook of communication science. Sage, 2010.
- The brain’s default network: anatomy, function, and relevance to disease. Annals of the new York Academy of Sciences, 2008.
- Smpler-x: Scaling up expressive human pose and shape estimation. arXiv preprint arXiv:2309.17448, 2023.
- Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
- Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, 2004, 2004.
- Lay dispositionism and implicit theories of personality. Journal of personality and social psychology, 73(1):19, 1997.
- Social influence: Compliance and conformity. Annu. Rev. Psychol., 2004.
- Simon Clavet. Motion matching and the road to next-gen animation. In Proc. of GDC, 2016.
- Sheldon Cohen. Social relationships and health. American psychologist, 2004.
- Stress, social support, and the buffering hypothesis. Psychological bulletin, 1985.
- The revised neo personality inventory (neo-pi-r). The SAGE handbook of personality theory and assessment, pages 179–198, 2008.
- Between facets and domains: 10 aspects of the big five. Journal of personality and social psychology, 2007.
- Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
- Graham A du Plessis and Gideon P de Bruin. Using rasch modelling to examine the international personality item pool (ipip) values in action (via) measure of character strengths. Journal of Psychology in Africa, 2015.
- Starkey Duncan Jr. Nonverbal communication. Psychological Bulletin, 1969.
- Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 2013.
- Hans Jurgen Eysenck and Sybil Bianca Giuletta Eysenck. Manual of the Eysenck Personality Questionnaire (junior & adult). Hodder and Stoughton Educational, 1975.
- S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984, 2023.
- SOMA: Solving optical marker-based mocap automatically. In Proc. International Conference on Computer Vision (ICCV), pages 11117–11126, 2021.
- Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1396–1406, 2021.
- Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022a.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts, 2022b.
- Edward Twitchell Hall. The hidden dimension. Anchor, 1966.
- Edward T Hall. The silent language. Anchor, 1973.
- A multiphasic personality schedule (minnesota): I. construction of the schedule. The Journal of Psychology, 1940.
- Fritz Heider. The psychology of interpersonal relations. Psychology Press, 2013.
- E Tory Higgins. Self-discrepancy: a theory relating self and affect. Psychological review, 1987.
- Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
- Learned motion matching. ACM TOG, 39(4):53–1, 2020.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- William James. The principles of psychology. Cosimo, Inc., 2007.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
- The big-five trait taxonomy: History, measurement, and theoretical perspectives. 1999.
- Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023b.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision, pages 590–606. Springer, 2022.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
- One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Abraham Harold Maslow. A dynamic theory of human motivation. 1958.
- Albert Mehrabian. Basic dimensions for a general psychological theory: Implications for personality, social, environmental, and developmental studies. 1980.
- Manual: A guide to the development and use of the myers-briggs type indicator. (No Title), 1985.
- Theodore M Newcomb. The prediction of interpersonal attraction. American psychologist, 1956.
- Symbolic architectures for cognition. Foundations of cognitive science, 1989.
- Toyoaki Nishida. Conversational informatics: An engineering approach. John Wiley & Sons, 2008.
- OpenAI. New and improved embedding model, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Social simulacra: Creating populated prototypes for social computing systems. In UIST, 2022.
- Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), New York, NY, USA, 2023a. Association for Computing Machinery.
- Generative agents: Interactive simulacra of human behavior. In UIST, 2023b.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019.
- Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
- The kit motion-language dataset. Big data, 4(4):236–252, 2016.
- Story-to-motion: Synthesizing infinite and controllable character animation from long text. arXiv preprint arXiv:2311.07446, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Trust in close relationships. Journal of personality and social psychology, 1985.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017.
- Julian B Rotter. A new scale for the measurement of interpersonal trust. Journal of personality, 1967.
- Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023.
- Cognitive, social, and physiological determinants of emotional state. Psychological review, 1962.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Chatgpt: Optimizing language models for dialogue. 2022.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
- Conflict-based search for optimal multi-agent pathfinding. AI, 219:40–66, 2015.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
- Computerized adaptive assessment of personality disorder: Introducing the cat–pd project. Journal of personality assessment, 2011.
- Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
- Making sense of cronbach’s alpha. International journal of medical education, 2:53, 2011.
- Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
- Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2022b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- What do people think they’re doing? action identification and human behavior. Psychological review, 1987.
- Lucas Veber. Auto-rig pro.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. arXiv preprint arXiv:2303.13796, 2023b.
- Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22035–22044, 2023c.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022b.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023.
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
- React: Synergizing reasoning and acting in language models. In ICLR, 2023b.
- Generating holistic 3d human motion from speech. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 469–480, 2023.
- Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16010–16021, 2023.
- Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023a.
- Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023c.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
- Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116, 2023d.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023b.