MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations (2310.10198v3)
Abstract: In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples. The resultant motion representation not only captures diverse motion skills but also offers a robust and intuitive interface for various applications. We demonstrate the versatility of MoConVQ through several applications: universal tracking control from various motion sources, interactive character control with latent motion representations using supervised learning, physics-based motion generation from natural language descriptions using the GPT framework, and, most interestingly, seamless integration with LLMs with in-context learning to tackle complex and abstract tasks.
- Text2Action: Generative Adversarial Synthesis from Language to Action. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018. IEEE, 1–5. https://doi.org/10.1109/ICRA.2018.8460608
- Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. Comput. Graph. Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946
- Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph. 42, 4, Article 44 (jul 2023), 20 pages. https://doi.org/10.1145/3592458
- Anthropic. 2023. Claude. https://claude.ai/.
- Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings. ACM Trans. Graph. 41, 6 (2022), 209:1–209:19. https://doi.org/10.1145/3550454.3555435
- GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. ACM Trans. Graph. (2023), 18 pages. https://doi.org/10.1145/3592097
- PMP: Learning to Physically Interact with Environments Using Part-Wise Motion Priors. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 64, 10 pages. https://doi.org/10.1145/3588432.3591487
- DReCon: Data-Driven Responsive Control of Physics-Based Characters. ACM Transactions on Graphics 38, 6 (Nov. 2019), 206:1–206:11.
- David Bollo. 2016. Inertialization: High-performance animation transitions in’gears of war’. Proc. of GDC 2018 (2016).
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- Generalized Biped Walking Control. ACM Transactions on Graphics 29, 4 (July 2010), 130:1–130:9.
- Synthesis of Constrained Walking Skills. In ACM SIGGRAPH Asia 2008 Papers (Singapore) (SIGGRAPH Asia ’08). Association for Computing Machinery, New York, NY, USA, Article 113, 9 pages. https://doi.org/10.1145/1457515.1409066
- Jukebox: A Generative Model for Music. arXiv preprint arXiv:2005.00341 (2020).
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
- SuperTrack: Motion Tracking for Physically Simulated Characters Using Supervised Learning. ACM Transactions on Graphics 40, 6 (Dec. 2021), 197:1–197:13.
- Human Pose as Compositional Tokens. In CVPR.
- Synthesis of Compositional Animations from Textual Descriptions. CoRR abs/2103.14675 (2021). arXiv:2103.14675 https://arxiv.org/abs/2103.14675
- Learning Spring Mass Locomotion: Guiding Policies With a Reduced-Order Model. IEEE Robotics Autom. Lett. 6, 2 (2021), 3926–3932. https://doi.org/10.1109/LRA.2021.3066833
- Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
- TM2T: Stochastic And Tokenized Modeling For the Reciprocal Generation Of 3D Human Motions And Texts. Springer-Verlag, Berlin, Heidelberg, 580–597. https://doi.org/10.1007/978-3-031-19833-5_34
- David Ha and Jürgen Schmidhuber. 2018. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 2455–2467.
- Robust Motion In-Betweening. ACM Trans. Graph. 39, 4, Article 60 (Jul 2020), 12 pages.
- Synthesizing Physical Character-Scene Interactions. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 63, 9 pages. https://doi.org/10.1145/3588432.3591525
- MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows. ACM Trans. Graph. 39, 6, Article 236 (nov 2020), 14 pages.
- Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc.
- Phase-Functioned Neural Networks for Character Control. ACM Trans. Graph. 36, 4, Article 42 (Jul 2017), 13 pages.
- Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (jul 2014), 1325–1339.
- MotionGPT: Human Motion as a Foreign Language. arXiv preprint arXiv:2306.14795 (2023).
- PADL: Language-Directed Physics-Based Character Control. In SIGGRAPH Asia 2022 Conference Papers (Daegu, Republic of Korea) (SA ’22). Association for Computing Machinery, New York, NY, USA, Article 19, 9 pages. https://doi.org/10.1145/3550469.3555391
- Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014. Banff, AB, Canada.
- A Survey on Reinforcement Learning Methods in Character Animation. Comput. Graph. Forum 41, 2 (2022), 613–639. https://doi.org/10.1111/cgf.14504
- Autoregressive Image Generation using Residual Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11523–11532.
- Learning Virtual Chimeras by Dynamic Motion Reassembly. ACM Transactions on Graphics 41, 6 (Nov. 2022), 182:1–182:13. https://doi.org/10.1145/3550454.3555489
- Data-Driven Biped Control. ACM Transactions on Graphics 29, 4 (July 2010), 129:1–129:8.
- Sergey Levine and Vladlen Koltun. 2013. Guided Policy Search. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (JMLR Workshop and Conference Proceedings, Vol. 28). JMLR.org, 1–9. http://proceedings.mlr.press/v28/levine13.html
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3383–3393.
- GANimator: Neural Motion Synthesis from a Single Sequence. ACM Trans. Graph. 41, 4, Article 138 (jul 2022), 12 pages.
- Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 11040–11049. https://doi.org/10.1109/CVPR52688.2022.01077
- Character Controllers Using Motion VAEs. ACM Transactions on Graphics 39, 4 (July 2020), 40:40:1–40:40:12.
- Libin Liu and Jessica Hodgins. 2017. Learning to Schedule Control Fragments for Physics-Based Characters Using Deep Q-Learning. ACM Transactions on Graphics 36, 4 (June 2017), 42a:1.
- On the Variance of the Adaptive Learning Rate and Beyond. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Guided Learning of Control Graphs for Physics-Based Characters. ACM Transactions on Graphics 35, 3 (May 2016), 29:1–29:14.
- Improving Sampling-based Motion Control. Computer Graphics Forum 34, 2 (May 2015), 415–423.
- Simulation and Control of Skeleton-Driven Soft Body Characters. ACM Transactions on Graphics 32, 6 (Nov. 2013), 1–8.
- SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 34, 6, Article 248 (oct 2015), 16 pages. https://doi.org/10.1145/2816795.2818013
- Perpetual Humanoid Control for Real-time Simulated Avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10895–10904.
- AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442–5451.
- Neural Probabilistic Motor Primitives for Humanoid Control. In International Conference on Learning Representations.
- Catch & Carry: Reusable Neural Controllers for Vision-Guided Whole-Body Tasks. ACM Transactions on Graphics 39, 4 (July 2020), 39:39:1–39:39:12.
- Discovery of complex behaviors through contact-invariant optimization. ACM Trans. Graph. 31, 4 (2012), 43:1–43:8. https://doi.org/10.1145/2185520.2185539
- A Survey on Deep Learning for Skeleton-Based Human Animation. Comput. Graph. Forum 41, 1 (2022), 122–157. https://doi.org/10.1111/cgf.14426
- Documentation Mocap Database HDM05. Technical Report CG-2007-2. Universität Bonn.
- OpenAI. 2023. ChatGPT. https://chat.openai.com.
- Generative Agents: Interactive Simulacra of Human Behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23) (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA.
- DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. ACM Transactions on Graphics 37, 4 (July 2018), 143:1–143:14.
- MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 331. Curran Associates Inc., Red Hook, NY, USA, 3686–3697.
- ASE: Large-Scale Reusable Adversarial Skill Embeddings for Physically Simulated Characters. ACM Trans. Graph. 41, 4, Article 94 (jul 2022), 17 pages.
- AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control. ACM Transactions on Graphics 40, 4 (July 2021), 144:1–144:20.
- Action-Conditioned 3D Human Motion Synthesis with Transformer VAE. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 10965–10975. https://doi.org/10.1109/ICCV48922.2021.01080
- TEMOS: Generating Diverse Human Motions from Textual Descriptions. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII (Lecture Notes in Computer Science, Vol. 13682), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 480–497. https://doi.org/10.1007/978-3-031-20047-2_28
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- Progressive Neural Networks. https://doi.org/10.48550/arXiv.1606.04671 arXiv:1606.04671 [cs]
- PhysCap: physically plausible monocular 3D motion capture in real time. ACM Transactions on Graphics 39 (11 2020), 1–16. https://doi.org/10.1145/3414685.3417877
- Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]
- Russ Smith. 2004. Open Dynamics Engine. https://ode.org/. Accessed 2023.09.01.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
- Stable Proportional-Derivative Controllers. IEEE Computer Graphics and Applications 31, 4 (July 2011), 34–44.
- CALM: Conditional Adversarial Latent Models; for Directable Virtual Characters. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 37, 9 pages. https://doi.org/10.1145/3588432.3591541
- Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu
- WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499 [cs]
- Neural Discrete Representation Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6309–6318.
- Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- UniCon: Universal Neural Controller For Physics-based Character Motion. CoRR abs/2011.15119 (2020). arXiv:2011.15119
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=_VjQlMeSB_J
- A Scalable Approach to Control Diverse Behaviors for Physically Simulated Characters. ACM Transactions on Graphics 39, 4 (July 2020), 33:33:1–33:33:12.
- Physics-Based Character Controllers Using Conditional VAEs. ACM Trans. Graph. 41, 4, Article 96 (Jul 2022), 12 pages.
- Jungdam Won and Jehee Lee. 2019. Learning Body Shape Variation in Physics-Based Characters. ACM Trans. Graph. 38, 6, Article 207 (nov 2019), 12 pages.
- Learning Soccer Juggling Skills with Layer-Wise Mixture-of-Experts. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 25, 9 pages. https://doi.org/10.1145/3528233.3530735
- Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Pei Xu and Ioannis Karamouzas. 2021. A GAN-Like Approach for Physics-Based Imitation Learning and Interactive Character Control. Proceedings of the ACM on Computer Graphics and Interactive Techniques 4, 3 (Sept. 2021), 44:1–44:22.
- Composite Motion Learning with Task Control. ACM Trans. Graph. 42, 4, Article 93 (jul 2023), 16 pages. https://doi.org/10.1145/3592447
- ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters. ACM Trans. Graph. 41, 6, Article 183 (2022). https://doi.org/10.1145/3550454.3555434
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models. CoRR abs/2305.10601 (2023). https://doi.org/10.48550/arXiv.2305.10601 arXiv:2305.10601
- ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
- Audio-Driven Stylized Gesture Generation with Flow-Based Model. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 13665), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 712–728. https://doi.org/10.1007/978-3-031-20065-6_41
- Continuation Methods for Adapting Simulated Skills. ACM Trans. Graph. 27, 3 (aug 2008), 1–7. https://doi.org/10.1145/1360612.1360680
- SIMBICON: Simple Biped Locomotion Control. ACM Transactions on Graphics 26, 3 (July 2007), 105–es.
- Discovering Diverse Athletic Jumping Strategies. ACM Trans. Graph. 40, 4, Article 91 (jul 2021), 17 pages. https://doi.org/10.1145/3450626.3459817
- MAAIP: Multi-Agent Adversarial Interaction Priors for Imitation from Fighting Demonstrations for Physics-Based Characters. Proc. ACM Comput. Graph. Interact. Tech. 6, 3, Article 32 (aug 2023), 20 pages. https://doi.org/10.1145/3606926
- PhysDiff: Physics-Guided Human Motion Diffusion Model. CoRR abs/2212.02500 (2022). https://doi.org/10.48550/arXiv.2212.02500 arXiv:2212.02500
- SimPoE: Simulated Character Control for 3D Human Pose Estimation. In Proceedings of (CVPR) Computer Vision and Pattern Recognition.
- SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312 [cs.SD]
- Learning Physically Simulated Tennis Skills from Broadcast Videos. ACM Trans. Graph. 42, 4, Article 95 (jul 2023), 14 pages. https://doi.org/10.1145/3592408
- T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).