Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Humanoid Locomotion as Next Token Prediction (2402.19469v1)

Published 29 Feb 2024 in cs.RO, cs.CV, and cs.LG

Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Robocat: A self-improving foundation agent for robotic manipulation. arXiv:2306.11706, 2023.
  2. Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817, 2022.
  3. Language models are few-shot learners. In NeurIPS, 2020a.
  4. Language models are few-shot learners. NeurIPS, 2020b.
  5. Robust feedback motion policy design using reinforcement learning on a 3d digit bipedal robot. In IROS, 2021.
  6. Generative pretraining from pixels. In ICML, 2020.
  7. The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In Humanoids, 2021.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HCT, 2019.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  10. Palm-e: An embodied multimodal language model. arXiv:2303.03378, 2023.
  11. Gansynth: Adversarial neural audio synthesis. arXiv:1902.08710, 2019.
  12. Generative adversarial nets. In NeurIPS, 2014.
  13. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. The development of honda humanoid robot. In ICRA, 1998.
  16. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  17. Long short-term memory. Neural computation, 1997.
  18. The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. In IROS, 2001.
  19. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  20. Kato, I. Development of wabot 1. Biomechanism, 1973.
  21. Videopoet: A large language model for zero-shot video generation. arXiv:2312.14125, 2023.
  22. Kuindersma, S. Recent progress on atlas, the world’s most dynamic humanoid robot, 2020. URL https://youtu.be/EGABAx52GKI.
  23. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023.
  24. AMASS: Archive of motion capture as surface shapes. In ICCV, 2019.
  25. Isaac gym: High performance gpu-based physics simulation for robot learning. In NeurIPS, 2021.
  26. Petman: A humanoid robot for testing chemical protective clothing. Journal of the Robotics Society of Japan, 2012.
  27. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
  28. The KIT motion-language dataset. Big Data, 2016.
  29. Improving language understanding by generative pre-training. 2018.
  30. Language models are unsupervised multitask learners. 2019.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Robot learning with sensorimotor pre-training. In CoRL, 2023a.
  33. Real-world humanoid locomotion with reinforcement learning. arXiv:2303.03381, 2023b.
  34. Raibert, M. H. Legged robots that balance. MIT press, 1986.
  35. Tracking people by predicting 3d appearance, location and pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2740–2749, 2022.
  36. Zero-shot text-to-image generation. In ICML, 2021.
  37. Shannon, C. E. Prediction and entropy of printed english. Bell system technical journal, 1951.
  38. Perceiver-actor: A multi-task transformer for robotic manipulation. In CoRL, 2022.
  39. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  40. Talos: A new humanoid research platform targeted for industrial applications. In Humanoids, 2017.
  41. Mujoco: A physics engine for model-based control. In IROS, 2012.
  42. Attention is all you need. In NeurIPS, 2017.
  43. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NeurIPS, 2016.
Citations (33)

Summary

  • The paper introduces generative modeling of sensorimotor trajectories as a novel approach to humanoid control by framing locomotion as next token prediction.
  • It employs a causal transformer to autoregressively predict future sensorimotor states from heterogeneous datasets, including motion capture and online video data.
  • Experiments show zero-shot generalization and competitive tracking accuracy in real-world tasks, validating the approach against traditional RL methods.

Humanoid Locomotion Through Generative Modeling of Sensorimotor Trajectories

Introduction

This paper investigates a novel approach for real-world humanoid control by casting it as a next token prediction problem, traditionally seen in LLMs. Unlike existing methods that rely heavily on reinforcement learning (RL) and model-based controls, this paper explores the autoregressive prediction of sensorimotor trajectories using a causal transformer model. The primary innovation lies in treating humanoid locomotion akin to generating sequential language tokens, where each token represents a piece of sensory or motor information. This perspective broadens the potential for incorporating diverse and incomplete datasets into the learning process, including those without direct action commands, such as human motion captured in videos.

Generative Modeling and Transformers in Robotics

The field of generative modeling and transformers has seen significant advancements, particularly in language and vision applications. This paper extends these successes to the domain of robotics and humanoid locomotion, capitalizing on the rich representational capabilities of transformers for handling high-dimensional, multi-modal sensorimotor data. Prior works have demonstrated transformers' efficacy in various robotic tasks through behavior cloning and reinforcement learning. This paper delineates a different pathway by engaging in autoregressive modeling of sensorimotor trajectories, facilitating learning from both complete and incomplete data sources.

Approach and Dataset

The core methodology involves training a transformer model to predict future sensorimotor states based on past sequences, with a key focus on accommodating trajectories with missing information, such as action data. This is achieved by leveraging mask tokens for missing modalities, allowing the model to learn from a broad spectrum of data, including human videos. The dataset encompasses trajectories generated by neural network policies, model-based controllers, motion capture data, and YouTube videos, providing a heterogeneous mix of high-quality and noisy, imperfect data.

Experiments and Results

Extensive experiments reveal that the proposed model can accomplish zero-shot generalization to real-world walking tasks, including navigating various terrains in San Francisco. Quantitative evaluations in both real and simulated environments demonstrate the model's competitive edge against state-of-the-art RL models, especially in tracking accuracy. The ability to train on data with missing actions and still achieve high performance underscores the approach's potential for scaling to vast, uncurated datasets.

Ablations and Analysis

Several ablation studies underscore the significance of design choices in the model's architecture and training regimen. Key findings illustrate the advantage of modeling both sensory observations and motor commands over an action-only prediction framework. The analysis further verifies the utility of modality-aligned token prediction and joint training on mixed datasets in enhancing the model’s learning efficiency and generalization capabilities.

Conclusion and Future Directions

The research outlined in this paper posits generative modeling of sensorimotor trajectories as a viable and innovative approach for mastering complex control tasks like humanoid locomotion. By drawing parallels with LLMing and exploiting transformers' capability to handle multimodal data, the paper sets a promising foundation for future explorations in robotics control. The success in training with mixed-quality datasets opens avenues for leveraging the vast amounts of video data available online, potentially catalyzing advancements in autonomous robotic movements.

Looking forward, the methodology could see enhancements in scaling the model and contextual understanding, further improving performance on diverse tasks beyond locomotion. The integrative analysis combining generative modeling with traditional robotics control strategies may also yield novel insights, driving the field towards more capable and flexible humanoid robots.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews