Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

126 tokens/sec

GPT-4o

47 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.2k 777 14 31

Genie: Generative Interactive Environments (2402.15391v1)

Published 23 Feb 2024 in cs.LG, cs.AI, and cs.CV

Abstract: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

References (85)

Citations (75)

View on Semantic Scholar

Summary

The paper introduces Genie, a generative model that creates diverse interactive virtual environments using unsupervised internet video data.
The paper details a novel architecture combining spatiotemporal video tokenization, autoregressive dynamics, and latent action modeling to manage complex video generation.
The paper validates Genie's efficacy through rigorous experiments on platformer and robotic datasets, achieving strong performance on metrics like FVD and ΔtPSNR.

Generative Interactive Environments: Introducing Genie

Overview of Genie

Genie represents a pioneering stride in the field of Generative AI, marked by its unique ability to create interactive environments. Trained on over 200,000 hours of unsupervised Internet video data, Genie emerges as the first generative model capable of producing a wide array of virtual worlds through various inputs such as text, images, and sketches. Genie stands out with its 11 billion parameters, making it a foundational model in the domain of world models.

Key Features and Methodology

The architecture of Genie incorporates three main components:

Spatiotemporal Video Tokenizer: Transforms raw video frames into discrete tokens.
Autoregressive Dynamics Model: Predicts future frames based on previous ones and latent actions.
Latent Action Model: Infers latent actions between frames in an unsupervised manner, enabling frame-by-frame control over the generated environment.

These components leverage state-of-the-art techniques such as spatiotemporal transformers and VQ-VAE for video tokenization, ensuring efficient management of the model's quadratic memory cost when dealing with videos. A significant merit of Genie's architecture is its scalability, demonstrated by a systematic analysis across different model and batch sizes.

Experimental Results

Genie was extensively evaluated on a dataset termed "Platformers", based on 2D platformer game videos from the Internet, and on robotic videos from the RT1 dataset. The model showcased remarkable performance across a spectrum of metrics like Frechet Video Distance (FVD) and controllability measured through ΔtPSNR. Moreover, Genie's applicability was validated through qualitative assessments, demonstrating its ability to generate plausible and highly interactive environments from diverse prompts, including out-of-distribution images.

Practical Implications and Future Directions

The implications of Genie are manifold, extending from enriching AI research with a foundational world model capable of broad application to enabling the creation of varied and dynamic virtual environments without the need for domain-specific knowledge or extensive labeled datasets. Future research could focus on expanding Genie's training dataset to encompass a more extensive range of Internet videos, thereby enhancing its generative capabilities further. Moreover, exploring Genie's potential in training generalist AI agents presents an exciting avenue for subsequent studies, possibly unlocking new realms in reinforcement learning and beyond.

Conclusion

Genie introduces a novel paradigm in generative AI, offering the ability to generate interactive and controllable environments from unsupervised video data. Its architecture, underpinned by the latest advances in transformers and video processing, presents significant scalability and performance, setting a new standard for future research in generative models and interactive environment generation. As we look ahead, Genie's broader impact, from democratizing content creation to advancing AI's generative capabilities, signals a promising horizon for both AI research and application.

PDF Markdown

Tweets

https://twitter.com/iScienceLuvr/status/1761941247667699751

https://twitter.com/_rockt/status/1762028247406027194

https://twitter.com/maxjaderberg/status/1763240131266224319

https://twitter.com/_akhaliq/status/1761961461151080740

https://twitter.com/_rockt/status/1814254692756582706

https://twitter.com/korymath/status/1762237362296234016

YouTube

Show All Videos

HackerNews

Genie: Generative Interactive Environments (5 points, 1 comment)
Genie: Generative Interactive Environments (3 points, 0 comments)
Genie: Generative Interactive Environments (2 points, 0 comments)
Genie: Generative Interactive Environments (2 points, 1 comment)
Google Genie: Generative Interactive Environments (1 point, 0 comments)
Genie: Generative Interactive Environments (1 point, 0 comments)

Google Deepmind announces Genie, the first generative interactive environment model (752 points, 330 comments)
[R] Genie: Generative Interactive Environments (25 points, 5 comments)