Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents (2403.12835v1)

Published 19 Mar 2024 in cs.CV and cs.RO

Abstract: Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In International Conference on Computer Vision (ICCV), 2019.
  2. Towards understanding mixture of experts in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  3. Reproducible scaling laws for contrastive language-image learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  4. CMU MOCAP. CMU Graphics Lab Motion Capture Database. https://http://mocap.cs.cmu.edu/, 2010. Accessed: 2023-10-25.
  5. Probio: A protocol-guided multimodal dataset for molecular biology lab. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
  6. Differentiable dynamics for articulated 3d human motion reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  7. Generating diverse and natural 3d human motions from text. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  8. Stochastic scene-aware motion prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  9. Synthesizing physical character-scene interactions. In ACM SIGGRAPH Conference Proceedings, 2023.
  10. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  11. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  12. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors, 21(4):1278, 2021.
  13. Diffusion-based generation, optimization, and planning in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  14. Motiongpt: Human motion as a foreign language. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
  15. Full-body articulated human-object interaction. In International Conference on Computer Vision (ICCV), 2023.
  16. Padl: Language-directed physics-based character control. In ACM SIGGRAPH Conference Proceedings, 2022.
  17. Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation. arXiv preprint arXiv:2211.15603, 2022.
  18. Humanoid self-collision avoidance using whole-body control with control barrier functions. In International Conference on Humanoid Robots (Humanoids), 2022.
  19. Words into action: Learning diverse humanoid behaviors using language guided iterative motion refinement. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
  20. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  21. Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments. In International Conference on Computer Vision (ICCV), 2023.
  22. Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  23. Motion-x: A large-scale 3d expressive whole-body human motion dataset. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  24. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
  25. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations (ICLR), 2018.
  26. Catch & carry: reusable neural controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG), 39(4):39–1, 2020.
  27. Coap: Compositional articulated occupancy of people. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  28. OpenAI. Introducing gpt-4. https://openai.com/blog/gpt-4, 2023.
  29. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
  30. Mcp: Learning composable hierarchical control with multiplicative compositional policies. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  31. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4):1–20, 2021.
  32. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions on Graphics (TOG), 41(4):1–17, 2022.
  33. Babel: Bodies, action and behavior with english labels. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  34. Human-centric indoor scene synthesis using stochastic grammar. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  35. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  36. Diffusion motion: Generate text-guided 3d human motion by diffusion model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  37. Vision-language models are zero-shot reward models for reinforcement learning. In International Conference on Learning Representations (ICLR), 2023.
  38. Motron: Multimodal probabilistic human motion forecasting. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  40. SFU MOCAP. SFU Motion Capture Database. https://mocap.cs.sfu.ca/, 2023. Accessed: 2023-11-01.
  41. Multi-agent generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  42. Calm: Conditional adversarial latent models for directable virtual characters. In ACM SIGGRAPH Conference Proceedings, 2023.
  43. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision (ECCV), 2022a.
  44. Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2022b.
  45. Recovering 3d human mesh from monocular images: A survey. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
  46. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.
  47. 3d human pose estimation via intuitive physics. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  48. Edge: Editable dance generation from music. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  49. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
  50. Scene-aware generative network for human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  51. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (ICML), 2020.
  52. Move as you say, interact as you can: Language-guided human motion generation with scene affordance. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  53. A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG), 39(4):33–1, 2020.
  54. Physics-based human motion estimation and synthesis from videos. In International Conference on Computer Vision (ICCV), 2021.
  55. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  56. Composite motion learning with task control. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023a.
  57. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In International Conference on Computer Vision (ICCV), 2023b.
  58. Language-guided generation of physically realistic robot motion and control. arXiv preprint arXiv:2306.10518, 2023c.
  59. Mime: Human-aware 3d scene generation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  60. Physdiff: Physics-guided human motion diffusion model. In International Conference on Computer Vision (ICCV), 2023.
  61. Learning physically simulated tennis skills from broadcast videos. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023a.
  62. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  63. Generating 3d people in scenes without people. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  64. Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023c.
  65. Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision (ECCV), 2022.
  66. Synthesizing diverse human motions in 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
  67. Test-time adaptation with CLIP reward for zero-shot generalization in vision-language models. In International Conference on Learning Representations (ICLR), 2024.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: