Real-World Robot Learning with Masked Visual Pre-training (2210.03109v1)

Published 6 Oct 2022 in cs.RO, cs.CV, and cs.LG

Abstract: In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Citations (210)

View on Semantic Scholar

Summary

The paper shows that MAE-based pre-training delivers up to 81% performance improvement in robotic control compared to conventional methods.
The paper employs a 307-million parameter Vision Transformer pre-trained on 4.5M diverse images, reducing required demonstrations by up to 50%.
The paper reveals that scalable self-supervised learning enables robust adaptability across diverse robotic tasks and real-world environments.

Real-World Robot Learning with Masked Visual Pre-training

The paper explores the application of self-supervised visual pre-training techniques to enhance the learning of robotic control tasks in real-world environments. By leveraging the masked autoencoder (MAE) framework for visual representation learning, the work evaluates its effectiveness compared to other state-of-the-art visual encoders deployed in robotic systems.

Key Contributions and Findings

A significant contribution of this research lies in demonstrating the superiority of MAE-based visual encoders over existing alternatives such as CLIP, supervised ImageNet pretrained models, and models trained from scratch for robotic tasks. The paper presents empirical evidence that pre-trained visual encoders lead to a marked improvement in task performance and sample efficiency. Notably, the paper reports up to 81% improved performance when employing MAE-based encoders compared to conventional methods.

Data Pre-training and System Architecture:
- The paper utilizes a massive dataset of 4.5 million images derived from internet and egocentric video sources for pre-training. This diverse data diversity is crucial in shaping an encoder capable of handling varying task environments and embodiments.
- A 307-million parameter Vision Transformer (ViT), significantly larger than conventional models used in robotics, is employed to scale visual pre-training and enhance model capacity.
Performance Evaluation:
- The paper reports extensive evaluations across 981 real-world experiments, focusing on fundamental motor tasks (e.g., reach, push, pick) to more complex tasks that involve varied objects and environments.
- The results indicate that MAE-based encoders not only achieve higher success rates than competitors but also require fewer demonstrations (up to 50% less) to match performance thresholds.
Scalability Insights:
- The research underscores the importance of jointly scaling model size and data scale to realize the full potential of self-supervised learning in robotic applications. Specifically, the need for large datasets becomes apparent when scaling up to larger models like the ViT-Large from ViT-Base.
Adaptation to Various Contexts:
- The paper provides insights into the encoder’s adaptability by applying it to a multi-finger Allegro hand and demonstrating its capability to perform diverse tasks effectively. This adaptability is essential for developing universal visual representations applicable across different robotic domains.

Practical and Theoretical Implications

From a practical perspective, the demonstrated improvements in task performance and data efficiency highlight the feasibility of using MAE-based representations in robotics. The generalized framework potentially reduces the dependence on task-specific details, allowing for broader deployment across a range of robotic applications with varying contexts and constraints.

Theoretically, this paper supports the notion that self-supervised visual pre-training can decompose intricate visual tasks into more manageable representations, facilitating efficient control policy learning. The systematic scaling of model size and data provides a methodology for future studies aiming to further explore the limits of self-supervised learning in robotics.

Conclusion

In summary, the paper provides substantial evidence that masked visual pre-training can significantly enhance robot learning in the real world, marking a pivotal step towards more generalizable and scalable robotic systems. Future research could explore expanding this approach to more complex robotic tasks, considering multi-step planning and dynamic environmental conditions, ultimately driving further advancements in autonomous systems.

PDF Markdown