We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. The project page is available at https://craftjarvis-groot.github.io.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Sign up for a free account or log in to generate a summary of this paper:
We ran into a problem analyzing this paper.
Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018a. https://api.semanticscholar.org/CorpusID:44061126.
Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018b. https://api.semanticscholar.org/CorpusID:44061126.
Learning about progress from experts. In International Conference on Learning Representations, 2023. https://api.semanticscholar.org/CorpusID:259298702.
Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13734–13744, 2023. https://api.semanticscholar.org/CorpusID:256194112.
Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021. https://api.semanticscholar.org/CorpusID:235294299.
Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jan 2019. 10.18653/v1/p19-1285. http://dx.doi.org/10.18653/v1/p19-1285.
Minerl: A large-scale dataset of minecraft demonstrations. In International Joint Conference on Artificial Intelligence, 2019. https://api.semanticscholar.org/CorpusID:199000710.
Vln-bert: A recurrent vision-and-language bert for navigation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1643–1653, 2020. https://api.semanticscholar.org/CorpusID:227228335.
Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022. https://api.semanticscholar.org/CorpusID:250451569.
The malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence, 2016. https://api.semanticscholar.org/CorpusID:9953039.
Simple but effective: Clip embeddings for embodied ai. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14809–14818, 2021. https://api.semanticscholar.org/CorpusID:244346010.
Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. https://api.semanticscholar.org/CorpusID:205242740.
Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2019. https://api.semanticscholar.org/CorpusID:204578308.
Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016. https://api.semanticscholar.org/CorpusID:515925.
Label Studio: Data labeling software, 2020-2022. https://github.com/heartexlabs/label-studio. Open source software available from https://github.com/heartexlabs/label-studio.
Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. https://api.semanticscholar.org/CorpusID:5855042.
Attention is all you need. In NIPS, 2017. https://api.semanticscholar.org/CorpusID:13756489.
Future-conditioned unsupervised pretraining for decision transformer. In International Conference on Machine Learning, 2023. https://api.semanticscholar.org/CorpusID:258947476.
Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining. In European Conference on Computer Vision, 2022. https://api.semanticscholar.org/CorpusID:250626771.