Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InsActor: Instruction-driven Physics-based Characters (2312.17135v1)

Published 28 Dec 2023 in cs.CV, cs.GR, and cs.RO

Abstract: Generating animation of physics-based characters with intuitive control has long been a desirable task with numerous applications. However, generating physically simulated animations that reflect high-level human instructions remains a difficult problem due to the complexity of physical environments and the richness of human language. In this paper, we present InsActor, a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters. Our framework empowers InsActor to capture complex relationships between high-level human instructions and character motions by employing diffusion policies for flexibly conditioned motion planning. To overcome invalid states and infeasible state transitions in planned motions, InsActor discovers low-level skills and maps plans to latent skill sequences in a compact latent space. Extensive experiments demonstrate that InsActor achieves state-of-the-art results on various tasks, including instruction-driven motion generation and instruction-driven waypoint heading. Notably, the ability of InsActor to generate physically simulated animations using high-level human instructions makes it a valuable tool, particularly in executing long-horizon tasks with a rich set of instructions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  2. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  3. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5167–5176, 2018.
  4. Drecon: Data-driven responsive control of physics-based characters. ACM Trans. Graph., 38(6), November 2019.
  5. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  6. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
  7. Brax–a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
  8. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  9. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020.
  10. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  11. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  12. Generalizing motion edits with gaussian processes. ACM Transactions on Graphics (TOG), 28(1):1–12, 2009.
  13. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  14. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
  15. Padl: Language-directed physics-based character control. Association for Computing Machinery, 2022.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Guided learning of control graphs for physics-based characters. ACM Transactions on Graphics, 35(3), 2016.
  18. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  19. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021.
  20. Composing pick-and-place tasks by grounding language. In International Symposium on Experimental Robotics, 2021.
  21. Learning object placements for relational instructions by hallucinating scene representations. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 94–100, 2020.
  22. Motion graphs++ a compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics (TOG), 31(6):1–12, 2012.
  23. Representing cyclic human motion using functional analysis. Image and Vision Computing, 23(14):1264–1276, 2005.
  24. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations, 2023.
  25. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018.
  26. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 2018.
  27. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  28. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  29. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  31. Contact and human dynamics from monocular video. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  32. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  33. Diffmimic: Efficient motion mimicking with differentiable physics. ICLR, 2022.
  34. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Trans. Graph., 39(6), nov 2020.
  35. Interactive visual grounding of referring expressions for human-robot interaction. In Proceedings of Robotics: Science and Systems, 2018.
  36. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
  37. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  38. Language-conditioned imitation learning for robot manipulation tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.
  39. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Mocapact: A multi-task dataset for simulated humanoid control. arXiv preprint arXiv:2208.07363, 2022.
  42. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12281–12288, 2020.
  43. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European conference on computer vision (ECCV), pages 265–281, 2018.
  44. Controlvae: Model-based learning of generative controllers for physics-based characters. 41(6), 2022.
  45. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  46. Simpoe: Simulated character control for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  47. Invigorate: Interactive visual grounding and grasping in clutter. Proceedings of Robotics: Science and Systems, abs/2108.11092, 2021.
  48. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiawei Ren (33 papers)
  2. Mingyuan Zhang (41 papers)
  3. Cunjun Yu (22 papers)
  4. Xiao Ma (169 papers)
  5. Liang Pan (93 papers)
  6. Ziwei Liu (368 papers)
Citations (13)

Summary

Introduction to Instruction-driven Animation

In recent years, there has been significant interest in creating animations that are not only visually realistic but can also be controlled intuitively through human instructions. This area of animation seeks to bridge the gap between high-level human commands and the generation of fluid physics-based character movements. Conventional approaches, like motion tracking, often face challenges in mapping these commands and struggle with complex instructions. On the other hand, conditional generative models might lack the finesse for precise control. A new approach called InsActor proposes to mitigate these issues with a hierarchical framework that uses diffusion models and skill discovery techniques.

The InsActor Framework

The InsActor framework combines high-level motion planning with low-level skill execution to produce animations that can be directed by human language instructions. Using the language-conditioned diffusion model, it initially generates a series of states (actions) based on the given commands. However, while this model captures the essence of the command-to-motion relationship, it might not guarantee feasible transitions between states, which is critical for smooth animations.

To combat this, InsActor employs a skill discovery mechanism. By encoding state transitions into a compact latent space, it maps these transition pairs to skill embeddings, translating them into appropriate animations. With this approach, InsActor breaks down the complex problem into manageable tasks across two different levels, offering adaptability and scalability.

Performance and Applications of InsActor

The real test for the InsActor framework is not only how well it can turn instructions into animations but also its robustness in different conditions and the visual plausibility of its output. Through extensive experimentation, it has been shown to achieve state-of-the-art results on a variety of tasks, which include interpreting human instructions and guiding characters through waypoints to specified targets. Importantly, the model can also withstand environmental perturbations, confirming its robustness and real-world applicability.

InsActor's ability to adapt to additional conditions, such as dealing with multiple waypoints and generating animations that comply with both historical and future objectives, demonstrates its broad potential for applications in video game design, virtual reality, and even in fields like robotics where visual representations of instructions are beneficial.

Looking to the Future

Despite InsActor demonstrating a strong capability in generating intuitive physics-based animations, there's always room for advancement. Improving the computational efficiency of the diffusion model itself is one aspect that could see the system scaling up to more complex environments or data sets. Moreover, broadening the scope to accommodate different human body shapes and morphologies presents another avenue for development.

Finally, as technologies like InsActor grow, they also bring forward ethical considerations about their potential misuse. Therefore, users and creators alike must remain vigilant about responsible applications of these advanced systems.

InsActor stands as a groundbreaking tool in the evolution of physics-based character animation. It achieves a delicate balance between user-input instructions and the generation of visually plausible animations, effectively pushing the boundaries of what can be achieved in this dynamic and evolving field.

Github Logo Streamline Icon: https://streamlinehq.com