Emergent Mind

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

(2310.08235)
Published Oct 12, 2023 in cs.AI and cs.LG

Abstract

We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. The project page is available at https://craftjarvis-groot.github.io.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Flamingo: a Visual Language Model for Few-Shot Learning
  2. Hindsight Experience Replay
  3. Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018a. https://api.semanticscholar.org/CorpusID:44061126.

  4. Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018b. https://api.semanticscholar.org/CorpusID:44061126.

  5. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
  6. RT-1: Robotics Transformer for Real-World Control at Scale
  7. Language Models are Few-Shot Learners
  8. Learning about progress from experts. In International Conference on Learning Representations, 2023. https://api.semanticscholar.org/CorpusID:259298702.

  9. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13734–13744, 2023. https://api.semanticscholar.org/CorpusID:256194112.

  10. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021. https://api.semanticscholar.org/CorpusID:235294299.

  11. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jan 2019. 10.18653/v1/p19-1285. http://dx.doi.org/10.18653/v1/p19-1285.

  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  13. Towards a unified agent with foundation models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023
  14. Goal-conditioned Imitation Learning
  15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  16. VTNet: Visual Transformer Network for Object Goal Navigation
  17. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
  18. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
  19. MindAgent: Emergent Gaming Interaction
  20. Minerl: A large-scale dataset of minecraft demonstrations. In International Joint Conference on Artificial Intelligence, 2019. https://api.semanticscholar.org/CorpusID:199000710.

  21. Vln-bert: A recurrent vision-and-language bert for navigation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1643–1653, 2020. https://api.semanticscholar.org/CorpusID:227228335.

  22. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022. https://api.semanticscholar.org/CorpusID:250451569.

  23. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
  24. Reinforcement learning from imperfect demonstrations under soft expert guidance. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5109–5116
  25. Adversarial option-aware hierarchical imitation learning. In International Conference on Machine Learning, pp.  5097–5106. PMLR
  26. The malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence, 2016. https://api.semanticscholar.org/CorpusID:9953039.

  27. Simple but effective: Clip embeddings for embodied ai. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14809–14818, 2021. https://api.semanticscholar.org/CorpusID:244346010.

  28. Auto-Encoding Variational Bayes
  29. Segment Anything
  30. In-context Reinforcement Learning with Algorithm Distillation
  31. STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
  32. JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning
  33. Goal-Conditioned Reinforcement Learning: Problems and Solutions
  34. ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
  35. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. https://api.semanticscholar.org/CorpusID:205242740.

  36. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2019. https://api.semanticscholar.org/CorpusID:204578308.

  37. A Generalist Agent
  38. Multitask prompted training enables zero-shot task generalization
  39. Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
  40. Proximal Policy Optimization Algorithms
  41. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016. https://api.semanticscholar.org/CorpusID:515925.

  42. Open-World Object Manipulation using Pre-trained Vision-Language Models
  43. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
  44. Label Studio: Data labeling software, 2020-2022. https://github.com/heartexlabs/label-studio. Open source software available from https://github.com/heartexlabs/label-studio.

  45. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. https://api.semanticscholar.org/CorpusID:5855042.

  46. Attention is all you need. In NIPS, 2017. https://api.semanticscholar.org/CorpusID:13756489.

  47. Voyager: An Open-Ended Embodied Agent with Large Language Models
  48. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
  49. Future-conditioned unsupervised pretraining for decision transformer. In International Conference on Machine Learning, 2023. https://api.semanticscholar.org/CorpusID:258947476.

  50. Imitation learning from observations by minimizing inverse dynamics disagreement. Advances in neural information processing systems, 32
  51. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  52. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining. In European Conference on Computer Vision, 2022. https://api.semanticscholar.org/CorpusID:250626771.

  53. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
  54. Fine-Tuning Language Models from Human Preferences

Show All 54