Humanoid Locomotion as Next Token Prediction (2402.19469v1)
Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.
- Robocat: A self-improving foundation agent for robotic manipulation. arXiv:2306.11706, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817, 2022.
- Language models are few-shot learners. In NeurIPS, 2020a.
- Language models are few-shot learners. NeurIPS, 2020b.
- Robust feedback motion policy design using reinforcement learning on a 3d digit bipedal robot. In IROS, 2021.
- Generative pretraining from pixels. In ICML, 2020.
- The mit humanoid robot: Design, motion planning, and control for acrobatic behaviors. In Humanoids, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HCT, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Palm-e: An embodied multimodal language model. arXiv:2303.03378, 2023.
- Gansynth: Adversarial neural audio synthesis. arXiv:1902.08710, 2019.
- Generative adversarial nets. In NeurIPS, 2014.
- Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- The development of honda humanoid robot. In ICRA, 1998.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Long short-term memory. Neural computation, 1997.
- The 3d linear inverted pendulum mode: A simple modeling for a biped walking pattern generation. In IROS, 2001.
- Scaling laws for neural language models. arXiv:2001.08361, 2020.
- Kato, I. Development of wabot 1. Biomechanism, 1973.
- Videopoet: A large language model for zero-shot video generation. arXiv:2312.14125, 2023.
- Kuindersma, S. Recent progress on atlas, the world’s most dynamic humanoid robot, 2020. URL https://youtu.be/EGABAx52GKI.
- Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023.
- AMASS: Archive of motion capture as surface shapes. In ICCV, 2019.
- Isaac gym: High performance gpu-based physics simulation for robot learning. In NeurIPS, 2021.
- Petman: A humanoid robot for testing chemical protective clothing. Journal of the Robotics Society of Japan, 2012.
- Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
- The KIT motion-language dataset. Big Data, 2016.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Robot learning with sensorimotor pre-training. In CoRL, 2023a.
- Real-world humanoid locomotion with reinforcement learning. arXiv:2303.03381, 2023b.
- Raibert, M. H. Legged robots that balance. MIT press, 1986.
- Tracking people by predicting 3d appearance, location and pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2740–2749, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Shannon, C. E. Prediction and entropy of printed english. Bell system technical journal, 1951.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In CoRL, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Talos: A new humanoid research platform targeted for industrial applications. In Humanoids, 2017.
- Mujoco: A physics engine for model-based control. In IROS, 2012.
- Attention is all you need. In NeurIPS, 2017.
- Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NeurIPS, 2016.