Latent Action Pretraining from Videos (2410.11758v1)

Published 15 Oct 2024 in cs.RO, cs.CL, cs.CV, and cs.LG

Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces LAPA, a novel method that learns latent actions from unlabeled videos to pretrain Vision-Language-Action models.
It uses VQ-VAE for latent action quantization and a Vision-Language model to predict these actions from video observations and task descriptions.
LAPA achieves over 6.22% improvement and 30x pretraining efficiency, outperforming traditional methods on cross-environment and unseen-object tasks.

Overview of "Latent Action Pretraining From Videos"

The paper "Latent Action Pretraining From Videos" presents Latent Action Pretraining for general Action models (LAPA), a novel unsupervised method aimed at pretraining Vision-Language-Action (VLA) models without needing explicit robot action labels. This research addresses limitations encountered in existing VLA models, which typically require action labels collected by humans, thus constraining the data sources and scaling opportunities. LAPA is structured to learn from large-scale internet videos lacking robot action labels, enabling the utilization of diverse, real-world human behavior exemplified in these videos.

Methodology

LAPA involves a sequence of pretraining stages, culminating in fine-tuning to map latent actions to actual robot actions:

Latent Action Quantization: The authors employ a VQ-VAE-based objective to create a discrete space of latent actions, analogously learning to tokenize actions without predefined priors such as end-effector or joint positions.
Latent Pretraining: A Vision-LLM is pretrained to predict these latent actions using only video observations and task descriptions, avoiding the need for labeled action data.
Fine-tuning: Finally, the VLA model is finetuned on a small labeled dataset to establish the mapping between latent actions and real robot actions.

Experimental Findings

The authors conducted experiments across various simulation and real-world environments to evaluate LAPA's efficacy:

Performance: LAPA outperformed previous methods that rely on actionless training videos, demonstrating significant gains in tasks involving cross-environment, cross-embodiment, innovative instruction following, and interactions with unseen objects.
Benchmark Comparisons: On the Language Table benchmark, LAPA showed superior results in both seen and unseen scenarios when compared to alternative models like UniPi and Vpt. In real-world experiments, LAPA outperformed OpenVLA, particularly when pretrained on human video data, highlighting its robustness to embodiment gaps.
Latency and Pretraining Efficiency: LAPA exhibited substantial efficiency improvements in pretraining compared to existing state-of-the-art models while achieving comparable or better performance.

Strong Numerical Results

LAPA demonstrated measurable improvements in experimental settings such as an average success rate improvement of over 6.22% beyond state-of-the-art models trained with traditional action labels. Additionally, the method achieved over 30 times greater pretraining efficiency.

Implications and Future Directions

LAPA opens new directions for leveraging large-scale, unlabelled video data in robotics, potentially reducing the dependency on costly labeled datasets. This paradigm shift suggests that future developments may extend to more complex action spaces, further integrating diverse human behavior as captured through online platforms. The implications for AI include enhanced generalization capabilities across varying environments and tasks, fostering advancements in generalist robotic policies.

In conclusion, LAPA presents a substantial step toward creating scalable and flexible robotic foundation models by utilizing vast, unstructured video data. This methodology could pave the way for more sophisticated applications in AI, opening up opportunities for action models to learn from the wealth of unlabelled human motion and interaction data available online. The potential to improve foundational models in robotics is significant, provided future research focuses on optimizing latent action representation and scaling data diversity.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jang_yoel/status/1846581029714485309

https://twitter.com/arankomatsuzaki/status/1846403405172076901

https://twitter.com/fly51fly/status/1846666320890487201

https://twitter.com/gm8xx8/status/1847008206868172965

https://twitter.com/arXivGPT/status/1846693495496430062

https://twitter.com/osanpochuudayo/status/1808433617301012500

YouTube

Show All Videos