Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

iVideoGPT: Interactive VideoGPTs are Scalable World Models (2405.15223v3)

Published 24 May 2024 in cs.CV, cs.LG, and cs.RO

Abstract: World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Overview of Interactive VideoGPT: Bridging Generative Video Models and Interactive World Models

The paper introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework designed to address the challenges in utilizing generative video models for interactive world models. The primary contributions of the work are situated at the intersection of video generation, model-based reinforcement learning (MBRL), and multimodal integration of agents' sensory inputs. This framework facilitates an interactive experience for agents through next-token prediction, allowing them to imagine, reason, and plan within high-dimensional environments.

Key Contributions

  1. Compressive Tokenization Technique: The proposed iVideoGPT employs a novel compressive tokenization mechanism, which discretizes complex visual observations into a manageable sequence of tokens. By conditionally encoding visual frames based on temporal context, this approach achieves a significant reduction in token sequence length, leading to more efficient training and generation processes.
  2. Scalable Autoregressive Transformer: iVideoGPT leverages an autoregressive transformer architecture similar to models used in LLMs. This allows for flexible handling of various multimodal signals, including visual frames, actions, and rewards. The architecture's scalability enables pre-training on millions of trajectories, creating a broad foundation for interactive world models that can be adapted to a wide array of downstream tasks.
  3. Comprehensive Pre-Training: The iVideoGPT framework is pre-trained on a diverse dataset of human and robotic manipulation trajectories, totaling over one million sequences. This extensive pre-training equips the model with generalizable knowledge about physical interactions, which can be fine-tuned for specific tasks such as action-conditioned video prediction, visual planning, and visual model-based reinforcement learning.

Numerical Results and Experimental Evaluation

The experimental results presented in the paper showcase the effectiveness of iVideoGPT across various metrics and datasets:

  • Video Prediction:

iVideoGPT provides competitive results on established benchmarks like the BAIR robot pushing and RoboNet datasets, achieving significant improvements in FVD, PSNR, SSIM, and LPIPS metrics. The model's ability to integrate actions into video prediction workflows enhances its interactivity and performance, particularly notable under action-conditioned scenarios.

  • Visual Planning:

The model's performance on the VP2^2 benchmark, which evaluates video prediction models for visual model-predictive control, underscores its robustness. iVideoGPT outperforms many baselines in specific Robosuite and RoboDesk tasks, demonstrating its applicability in control tasks where accurate and realistic predictions are crucial.

  • Visual Model-Based Reinforcement Learning:

The visual MBRL experiments on Meta-World tasks highlight the remarkable sample efficiency of iVideoGPT-enabled algorithms. The model-based approach, leveraging iVideoGPT for synthetic rollouts, outperforms model-free alternatives and achieves comparable results to state-of-the-art latent imagination methods like DreamerV3.

Practical and Theoretical Implications

The practical implications of this work are substantial. By enabling a more efficient and scalable approach to building interactive world models, iVideoGPT represents a significant step forward in the application of generative video models to real-world decision-making tasks:

  • Model-Based Learning Efficiency:

The ability to pre-train on large, diverse datasets and fine-tune efficiently for specific tasks can drastically reduce the requirement for extensive data collection in new environments. This is particularly advantageous in robotics and autonomous systems where real-world trials can be costly and time-consuming.

  • Generalization and Adaptation:

iVideoGPT's ability to generalize from human manipulation datasets to diverse robotic contexts highlights the model's potential for robust transfer learning. This capability is crucial for developing versatile agents that perform consistently across various environments and tasks.

Future Directions

The promising results from iVideoGPT pave the way for several future research directions:

  • Scaling and Diverse Applications:

Further scaling of the architecture and pre-training on more diverse, Internet-scale datasets could enhance the model's generalizability and performance. This would be especially relevant for applications in complex, real-world scenarios such as autonomous driving and general-purpose robotics.

  • Enhancements in Tokenization:

Investigating alternative tokenization strategies that maintain high fidelity while further reducing computational overhead could lead to even more efficient training and inference processes. Improvements in this area might also enhance the model's ability to handle higher-resolution inputs and more complex scenarios.

  • Integration with Other Modalities:

Extending the multimodal capabilities of iVideoGPT to include additional sensory inputs such as audio and haptic feedback could broaden the range of applications and improve the model's performance in environments where multisensory integration is critical.

Overall, iVideoGPT represents a significant advancement in combining the strengths of autoregressive transformers, generative video models, and interactive world modeling, providing a robust foundation for future research and practical applications in MBRL and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (112)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Stochastic variational video prediction. In ICLR, 2018.
  3. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
  4. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
  5. Hydra: Hybrid robot actions for imitation learning. In CoRL, 2023.
  6. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  7. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  8. AudioLM: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  9. RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  10. Video generation models as world simulators. 2024.
  11. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
  12. Playfusion: Skill acquisition via diffusion from language-annotated play. In CoRL, 2023.
  13. Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023.
  14. From play to policy: Conditional behavior generation from uncurated robot data. In ICLR, 2023.
  15. RoboNet: Large-scale multi-robot learning. In CoRL, 2019.
  16. CLVR jaco play dataset, 2023.
  17. Stochastic video generation with a learned prior. In ICML, 2018.
  18. Video language planning. In ICLR, 2024.
  19. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
  20. Self-supervised visual planning with temporal skip connections. In CoRL, 2017.
  21. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  22. Finetuning offline world models in the real world. In CoRL, 2023.
  23. Deep visual foresight for planning robot motion. In ICRA, 2017.
  24. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
  25. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  26. Maskvit: Masked visual pre-training for video prediction. In ICLR, 2023.
  27. Recurrent world models facilitate policy evolution. In NeurIPS, 2018.
  28. Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
  29. Learning latent dynamics for planning from pixels. In ICML, 2019.
  30. Mastering atari with discrete world models. In ICLR, 2021.
  31. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  32. Temporal difference learning for model predictive control. In ICML, 2022.
  33. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In RSS, 2023.
  34. Deep q-learning from demonstrations. In AAAI, 2018.
  35. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  36. Video diffusion models. In NeurIPS, 2022.
  37. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  38. Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
  39. BC-z: Zero-shot task generalization with robotic imitation learning. In CoRL, 2021.
  40. When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
  41. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
  42. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  43. Model-based reinforcement learning for atari. In ICLR, 2020.
  44. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018.
  45. Robodesk: A multi-task reinforcement learning benchmark. https://github.com/google-research/robodesk, 2021.
  46. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  47. Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In IROS, 2023.
  48. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  49. Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of the ACM, 34(4):46–58, 1991.
  50. Ccvs: Context-aware controllable video synthesis. In NeurIPS, 2021.
  51. Yann LeCun. A path towards autonomous machine intelligence. preprint posted on openreview, 2022.
  52. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In ICRA, 2019.
  53. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024.
  54. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  55. Harmonydream: Task harmonization inside world models. In ICML, 2024.
  56. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In IROS, 2019.
  57. Weblab xArm Dataset, 2023.
  58. Grounding language with visual affordances over unstructured data. In ICRA, 2023.
  59. Structured world models from human videos. In RSS, 2023.
  60. Transformers are sample efficient world models. In ICLR, 2023.
  61. Unsupervised learning of object structure and dynamics from videos. In NeurIPS, 2019.
  62. Learning and retrieval from prior data for skill-based imitation learning. In CoRL, 2022.
  63. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.
  64. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  65. X-Embodiment U-Tokyo PR2 Datasets, 2023.
  66. Action-conditional video prediction using deep networks in atari games. In NeurIPS, 2015.
  67. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  68. A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world. Research square preprint rs-3289569/v1, 2023.
  69. Guiding reinforcement learning with shared control templates. In ICRA, 2023.
  70. amused: An open muse reproduction. arXiv preprint arXiv:2401.01808, 2024.
  71. Shared Control Templates for Assistive Robotics. In ICRA, 2020.
  72. Improving language understanding by generative pre-training. 2018.
  73. Language models are unsupervised multitask learners. 2019.
  74. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  75. Latent plans for task agnostic offline reinforcement learning. In CoRL, 2022.
  76. Playing with food: Learning food item representations through interactive exploration. In Experimental Robotics: The 17th International Symposium, pages 309–322. Springer, 2021.
  77. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  78. Reinforcement learning with action-free pre-training from videos. In ICML, 2022.
  79. MUTEX: Learning unified policies from multimodal task specifications. In CoRL, 2023.
  80. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
  81. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  82. A control-centric benchmark for video prediction. In ICLR, 2023.
  83. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  84. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  85. Neural discrete representation learning. In NeurIPS, 2017.
  86. Attention is all you need. In NeurIPS, 2017.
  87. High fidelity video prediction with large stochastic recurrent neural networks. In NeurIPS, 2019.
  88. Edan - an emg-controlled daily assistant to help people with physical disabilities. In IROS, 2020.
  89. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 2022.
  90. Bridgedata v2: A dataset for robot learning at scale. In CoRL, 2023.
  91. D3fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation. arXiv preprint arXiv:2309.16118, 2023.
  92. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022.
  93. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  94. Greedy hierarchical variational autoencoders for large-scale video prediction. In CVPR, 2021.
  95. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023.
  96. UCSD Kitchens Dataset. August 2023.
  97. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  98. Learning interactive real-world simulators. In ICLR, 2024.
  99. Mastering visual continuous control: Improved data-augmented reinforcement learning. In ICLR, 2022.
  100. MAGVIT: Masked generative video transformer. In CVPR, 2023.
  101. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2020.
  102. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  103. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  104. Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024.
  105. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  106. Train offline, test online: A real robot learning benchmark. In ICRA, 2023.
  107. Learning modular language-conditioned robot policies through attention. Autonomous Robots, pages 1–21, 2023.
  108. Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation. In CoRL, 2023.
  109. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023.
  110. Viola: Imitation learning for vision-based manipulation with object proposal priors. CoRL, 2022.
  111. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
  112. robosuite: A modular simulation framework and benchmark for robot learning, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jialong Wu (36 papers)
  2. Shaofeng Yin (5 papers)
  3. Ningya Feng (5 papers)
  4. Xu He (66 papers)
  5. Dong Li (429 papers)
  6. Jianye Hao (185 papers)
  7. Mingsheng Long (110 papers)
Citations (9)
Youtube Logo Streamline Icon: https://streamlinehq.com