Humanoid Locomotion as Next Token Prediction (2402.19469v1)

Published 29 Feb 2024 in cs.RO, cs.CV, and cs.LG

Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training like walking backward. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor trajectories.

References (43)

Citations (33)

View on Semantic Scholar

Summary

The paper introduces generative modeling of sensorimotor trajectories as a novel approach to humanoid control by framing locomotion as next token prediction.
It employs a causal transformer to autoregressively predict future sensorimotor states from heterogeneous datasets, including motion capture and online video data.
Experiments show zero-shot generalization and competitive tracking accuracy in real-world tasks, validating the approach against traditional RL methods.

Humanoid Locomotion Through Generative Modeling of Sensorimotor Trajectories

Introduction

This paper investigates a novel approach for real-world humanoid control by casting it as a next token prediction problem, traditionally seen in LLMs. Unlike existing methods that rely heavily on reinforcement learning (RL) and model-based controls, this paper explores the autoregressive prediction of sensorimotor trajectories using a causal transformer model. The primary innovation lies in treating humanoid locomotion akin to generating sequential language tokens, where each token represents a piece of sensory or motor information. This perspective broadens the potential for incorporating diverse and incomplete datasets into the learning process, including those without direct action commands, such as human motion captured in videos.

Generative Modeling and Transformers in Robotics

The field of generative modeling and transformers has seen significant advancements, particularly in language and vision applications. This paper extends these successes to the domain of robotics and humanoid locomotion, capitalizing on the rich representational capabilities of transformers for handling high-dimensional, multi-modal sensorimotor data. Prior works have demonstrated transformers' efficacy in various robotic tasks through behavior cloning and reinforcement learning. This paper delineates a different pathway by engaging in autoregressive modeling of sensorimotor trajectories, facilitating learning from both complete and incomplete data sources.

Approach and Dataset

The core methodology involves training a transformer model to predict future sensorimotor states based on past sequences, with a key focus on accommodating trajectories with missing information, such as action data. This is achieved by leveraging mask tokens for missing modalities, allowing the model to learn from a broad spectrum of data, including human videos. The dataset encompasses trajectories generated by neural network policies, model-based controllers, motion capture data, and YouTube videos, providing a heterogeneous mix of high-quality and noisy, imperfect data.

Experiments and Results

Extensive experiments reveal that the proposed model can accomplish zero-shot generalization to real-world walking tasks, including navigating various terrains in San Francisco. Quantitative evaluations in both real and simulated environments demonstrate the model's competitive edge against state-of-the-art RL models, especially in tracking accuracy. The ability to train on data with missing actions and still achieve high performance underscores the approach's potential for scaling to vast, uncurated datasets.

Ablations and Analysis

Several ablation studies underscore the significance of design choices in the model's architecture and training regimen. Key findings illustrate the advantage of modeling both sensory observations and motor commands over an action-only prediction framework. The analysis further verifies the utility of modality-aligned token prediction and joint training on mixed datasets in enhancing the model’s learning efficiency and generalization capabilities.

Conclusion and Future Directions

The research outlined in this paper posits generative modeling of sensorimotor trajectories as a viable and innovative approach for mastering complex control tasks like humanoid locomotion. By drawing parallels with LLMing and exploiting transformers' capability to handle multimodal data, the paper sets a promising foundation for future explorations in robotics control. The success in training with mixed-quality datasets opens avenues for leveraging the vast amounts of video data available online, potentially catalyzing advancements in autonomous robotic movements.

Looking forward, the methodology could see enhancements in scaling the model and contextual understanding, further improving performance on diverse tasks beyond locomotion. The integrative analysis combining generative modeling with traditional robotics control strategies may also yield novel insights, driving the field towards more capable and flexible humanoid robots.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1763443668885569818

https://twitter.com/BrianRoemmele/status/1763455918568767550

https://twitter.com/trevordarrell/status/1763609660265812011

https://twitter.com/jpirruccello/status/1764108878802026832

https://twitter.com/tsarnick/status/1763474085391257808

https://twitter.com/Montreal_AI/status/1763926281144475656

YouTube

Show All Videos

HackerNews

Humanoid Locomotion as Next Token Prediction (1 point, 0 comments)