Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (2206.11795v1)

Published 23 Jun 2022 in cs.LG and cs.AI

Abstract: Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

Citations (245)

Summary

  • The paper introduces VPT, a method that uses semi-supervised imitation learning to harness unlabeled online videos for sequential decision-making tasks.
  • It outlines a pipeline where an inverse dynamics model pseudo-labels data, achieving 90.6% keypress accuracy on Minecraft with minimal labeled input.
  • Fine-tuning with reinforcement learning enables complex in-game actions like diamond crafting, showcasing VPT's potential in scalable decision domains.

An Overview of Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

The paper "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" presents an innovative methodology for training models in sequential decision-making environments. The authors aim to extend the pretraining paradigm, which has been successful in text and image modalities, to domains such as robotics and video games through a process they call Video PreTraining (VPT). This paper focuses on the video game Minecraft, leveraging the extensive availability of online gameplay videos.

Methodological Contributions

Key to the VPT approach is semi-supervised imitation learning, where unlabeled videos serve as the primary dataset. The authors implement a structured pipeline as follows:

  1. Inverse Dynamics Model (IDM): Initially trained using a relatively small labeled dataset obtained from human players, the IDM is responsible for predicting actions between two observed frames. The accuracy of this model is noteworthy, achieving 90.6% keypress accuracy with less than 2000 hours of labeled data.
  2. Pseudo-Labeling and Data Filtering: This IDM is then used to pseudo-label a vast collection of unlabeled Minecraft videos curated from the internet. Additionally, to ensure the relevance and quality of data, videos undergo a cleaning process to eliminate those with artifacts or from alternate game modes.
  3. Behavioral Cloning and Foundation Model Training: Using the pseudo-labeled data, the authors train a foundation model, which demonstrates nontrivial zero-shot capabilities—that is, the model exhibits complex in-game skills without task-specific fine-tuning.
  4. Fine-Tuning: The pre-trained model is further refined through both behavioral cloning and reinforcement learning (RL), enhancing its performance on specific tasks. Notably, the agents are capable of achieving activities like crafting diamond tools—a feat that challenges even proficient human players.

Results and Implications

The results underscore the potential of extending pretraining techniques to interactive environments. The VPT method not only enables models to exhibit human-level performance on intricate tasks but also addresses exploration bottlenecks traditionally encountered in RL settings. The decision to use Minecraft, a complex and sandbox-styled video game, demonstrates the VPT method's applicability to real-world tasks that exhibit open-ended, varied challenges.

Insights on Data Efficiency

A salient outcome of the research is the demonstration of IDM’s data efficiency relative to direct behavioral cloning. The paper indicates that minimal labeled data, when used effectively within the VPT pipeline, can unlock significant volumes of online unlabeled data, drastically reducing the resources traditionally necessary for imitation learning projects.

The exploration into scaling properties also sheds light on the importance of balancing model size and data volume, emphasizing that larger models retain advantages particularly notable during task specialization.

Looking Ahead

While VPT was experimentally applied to Minecraft, its broader implications are profound. This framework can potentially be adapted to other domains with substantial unlabeled data, such as user behavior modeling in computer interfaces or robotics.

Furthermore, the preliminary explorations into text-conditioned task execution suggest new research avenues. By enhancing text-based conditioning, models could eventually gain the ability to perform a wide array of tasks specified via natural language, expanding the utility and adaptability of pretrained agents significantly.

Overall, the paper provides a compelling groundwork for future development of AI agents capable of understanding and engaging in sophisticated behaviors through pretraining paradigms. The deliberate and structured approach of VPT serves as a robust framework for emulation, poised to influence future research in sequential decision domains.

Youtube Logo Streamline Icon: https://streamlinehq.com