Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GROOT: Learning to Follow Instructions by Watching Gameplay Videos (2310.08235v2)

Published 12 Oct 2023 in cs.AI and cs.LG

Abstract: We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. The project page is available at https://craftjarvis-groot.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022. URL https://api.semanticscholar.org/CorpusID:248476411.
  2. Hindsight experience replay. ArXiv, abs/1707.01495, 2017. URL https://api.semanticscholar.org/CorpusID:3532908.
  3. Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018a. URL https://api.semanticscholar.org/CorpusID:44061126.
  4. Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018b. URL https://api.semanticscholar.org/CorpusID:44061126.
  5. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673.
  6. Rt-1: Robotics transformer for real-world control at scale. ArXiv, abs/2212.06817, 2022. URL https://api.semanticscholar.org/CorpusID:254591260.
  7. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
  8. Learning about progress from experts. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:259298702.
  9. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13734–13744, 2023. URL https://api.semanticscholar.org/CorpusID:256194112.
  10. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235294299.
  11. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jan 2019. 10.18653/v1/p19-1285. URL http://dx.doi.org/10.18653/v1/p19-1285.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  13. Towards a unified agent with foundation models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023.
  14. Goal-conditioned imitation learning. ArXiv, abs/1906.05838, 2019. URL https://api.semanticscholar.org/CorpusID:189762519.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar.org/CorpusID:225039882.
  16. Vtnet: Visual transformer network for object goal navigation. ArXiv, abs/2105.09447, 2021. URL https://api.semanticscholar.org/CorpusID:234790212.
  17. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ArXiv, abs/1802.01561, 2018. URL https://api.semanticscholar.org/CorpusID:3645060.
  18. Minedojo: Building open-ended embodied agents with internet-scale knowledge. ArXiv, abs/2206.08853, 2022. URL https://api.semanticscholar.org/CorpusID:249848263.
  19. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023.
  20. Minerl: A large-scale dataset of minecraft demonstrations. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:199000710.
  21. Vln-bert: A recurrent vision-and-language bert for navigation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1643–1653, 2020. URL https://api.semanticscholar.org/CorpusID:227228335.
  22. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:250451569.
  23. Voxposer: Composable 3d value maps for robotic manipulation with language models. ArXiv, abs/2307.05973, 2023. URL https://api.semanticscholar.org/CorpusID:259837330.
  24. Reinforcement learning from imperfect demonstrations under soft expert guidance. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5109–5116, 2020.
  25. Adversarial option-aware hierarchical imitation learning. In International Conference on Machine Learning, pp.  5097–5106. PMLR, 2021.
  26. The malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence, 2016. URL https://api.semanticscholar.org/CorpusID:9953039.
  27. Simple but effective: Clip embeddings for embodied ai. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14809–14818, 2021. URL https://api.semanticscholar.org/CorpusID:244346010.
  28. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013. URL https://api.semanticscholar.org/CorpusID:216078090.
  29. Segment anything. ArXiv, abs/2304.02643, 2023. URL https://api.semanticscholar.org/CorpusID:257952310.
  30. In-context reinforcement learning with algorithm distillation. ArXiv, abs/2210.14215, 2022. URL https://api.semanticscholar.org/CorpusID:253107613.
  31. Steve-1: A generative model for text-to-behavior in minecraft. ArXiv, abs/2306.00937, 2023. URL https://api.semanticscholar.org/CorpusID:258999563.
  32. Juewu-mc: Playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021.
  33. Goal-conditioned reinforcement learning: Problems and solutions. ArXiv, abs/2201.08299, 2022. URL https://api.semanticscholar.org/CorpusID:246063885.
  34. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. ArXiv, abs/2206.12403, 2022. URL https://api.semanticscholar.org/CorpusID:250048645.
  35. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. URL https://api.semanticscholar.org/CorpusID:205242740.
  36. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:204578308.
  37. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  38. Multitask prompted training enables zero-shot task generalization, 2021.
  39. Juergen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards - just map them to actions. ArXiv, abs/1912.02875, 2019. URL https://api.semanticscholar.org/CorpusID:208857600.
  40. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
  41. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016. URL https://api.semanticscholar.org/CorpusID:515925.
  42. Open-world object manipulation using pre-trained vision-language models. ArXiv, abs/2303.00905, 2023. URL https://api.semanticscholar.org/CorpusID:257280290.
  43. Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv, abs/1905.11946, 2019. URL https://api.semanticscholar.org/CorpusID:167217261.
  44. Label Studio: Data labeling software, 2020-2022. URL https://github.com/heartexlabs/label-studio. Open source software available from https://github.com/heartexlabs/label-studio.
  45. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. URL https://api.semanticscholar.org/CorpusID:5855042.
  46. Attention is all you need. In NIPS, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  47. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023a. URL https://api.semanticscholar.org/CorpusID:258887849.
  48. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. ArXiv, abs/2302.01560, 2023b. URL https://api.semanticscholar.org/CorpusID:256598146.
  49. Future-conditioned unsupervised pretraining for decision transformer. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:258947476.
  50. Imitation learning from observations by minimizing inverse dynamics disagreement. Advances in neural information processing systems, 32, 2019.
  51. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. ArXiv, abs/1910.10897, 2019. URL https://api.semanticscholar.org/CorpusID:204852201.
  52. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining. In European Conference on Computer Vision, 2022. URL https://api.semanticscholar.org/CorpusID:250626771.
  53. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. ArXiv, abs/2305.17144, 2023. URL https://api.semanticscholar.org/CorpusID:258959262.
  54. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593, 2019. URL https://api.semanticscholar.org/CorpusID:202660943.
Citations (21)

Summary

We haven't generated a summary for this paper yet.