Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics (2403.19578v3)
Abstract: We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural LLMs for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.
- Deep vit features as dense visual descriptors, 2022. arXiv:2112.05814.
- Emerging properties in self-supervised vision transformers, 2021. arXiv:2104.14294.
- Diffusion policy: Visuomotor policy learning via action diffusion, 2023. arXiv:2303.04137.
- Rethinking attention with performers, 2022. arXiv:2009.14794.
- Model-based reinforcement learning via meta-policy optimization, 2018. arXiv:1809.05214.
- Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. arXiv:2310.08864.
- An unbiased look at datasets for visuo-motor pre-training, 2023. arXiv:2310.09289.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021. arXiv:2010.11929.
- Palm-e: An embodied multimodal language model, 2023. arXiv:2303.03378.
- One-shot imitation learning, 2017. arXiv:1703.07326.
- Anthony Brohan et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023a. arXiv:2307.15818.
- Gemini Team et al. Gemini: A family of highly capable multimodal models, 2023b. arXiv:2312.11805.
- Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023c. arXiv:2307.09288.
- Kristen Grauman et al. Ego4d: Around the world in 3,000 hours of egocentric video, 2022a. arXiv:2110.07058.
- Michael Ahn et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022b. arXiv:2204.01691.
- OpenAI et al. Gpt-4 technical report, 2023d. arXiv:2303.08774.
- Rishi Bommasani et al. On the opportunities and risks of foundation models, 2022c. arXiv:2108.07258.
- Tom B. Brown et al. Language models are few-shot learners, 2020. arXiv:2005.14165.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023. arXiv:2312.00752.
- Training compute-optimal large language models, 2022. arXiv:2203.15556.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022. arXiv:2201.07207.
- Voxposer: Composable 3d value maps for robotic manipulation with language models, 2023. arXiv:2307.05973.
- Task-embedded control networks for few-shot imitation learning. In Conference on robot learning, pages 783–795. PMLR, 2018.
- Action image representation: Learning scalable deep grasping policies with zero real world data, 2020. arXiv:2005.06594.
- Language models as zero-shot trajectory generators, 2023. arXiv:2310.11604.
- Code as policies: Language model programs for embodied control, 2023. arXiv:2209.07753.
- Large language models as general pattern machines, 2023. arXiv:2307.04721.
- R3m: A universal visual representation for robot manipulation, 2022. arXiv:2203.12601.
- In-context learning and induction heads. arXiv prnote arXiv:2209.11895, 2022.
- Language models are unsupervised multitask learners, 2019.
- Learning transferable visual models from natural language supervision, 2021. arXiv:2103.00020.
- Attention is all you need, 2023. arXiv:1706.03762.
- Chatgpt for robotics: Design principles and model abilities, 2023. arXiv:2306.17582.
- Transformers learn in-context by gradient descent, 2023. arXiv:2212.07677.
- Few-shot in-context imitation learning via implicit graph alignment, 2023. arXiv:2310.12238.
- Language to rewards for robotic skill synthesis, 2023. arXiv:2306.08647.
- Train offline, test online: A real robot learning benchmark, 2023. arXiv:2306.00942.