Video as the New Language for Real-World Decision Making (2402.17139v1)
Abstract: Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: LLMs have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like LLMs, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside LLMs in a wider array of AI applications.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Planning in stochastic environments with a learned model. In International Conference on Learning Representations. ICLR, 2022.
- Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
- Neural game engine: Accurate learning of generalizable forward models from pixels. In 2020 IEEE Conference on Games (CoG), pp. 81–88, 2020. doi: 10.1109/CoG47356.2020.9231688.
- Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
- Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
- Blanco-Claraco, J. L. A tutorial on se(3) transformation parameterizations and on-manifold optimization. arXiv preprint arXiv:2103.15980, 2021.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. arXiv preprint arXiv:2304.08818, 2023b.
- Genie: Generative interactive environments, 2024.
- Vision-language models as a source of rewards. In Second Agent Learning in Open-Endedness Workshop, 2023.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325, 2022.
- Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pp. 2048–2056. PMLR, 2020.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Dennett, D. C. Consciousness explained. Penguin uk, 1993.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Vision-language models as success detectors. In Proceedings of The 2nd Conference on Lifelong Learning Agents, pp. 120–136, 2023a.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023b.
- Video language planning. arXiv preprint arXiv:2310.10625, 2023c.
- Imitating latent policies from observation. In International conference on machine learning, pp. 1755–1763. PMLR, 2019.
- Image quilting for texture synthesis and transfer. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 571–576. 2023.
- Video prediction models as rewards for reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- A trillion frames per second: the techniques and applications of light-in-flight photography. Reports on Progress in Physics, 81(10):105901, 2018.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
- Let’s think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 204–219, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
- Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
- Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022.
- Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters, 4(3):2407–2414, 2019.
- Illuminating generalization in deep reinforcement learning through procedural level generation. CoRR, abs/1806.10729, 2018.
- Imagined subgoals for hierarchical goal-conditioned policies. In CoRL 2023 Workshop on Learning Effective Abstractions for Planning (LEAP), 2023.
- Neural network analysis of electron microscopy video data reveals the temperature-driven microphase dynamics in the ions/water system. Small, 17(24):2007726, 2021.
- Learning to Simulate Dynamic Environments with GameGAN. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020.
- Learning to act from actionless videos through dense correspondences, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
- Text-driven stylization of video objects. In European Conference on Computer Vision, pp. 594–609. Springer, 2022.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Score-based generative models for calorimeter shower simulation. Physical Review D, 106(9):092009, 2022.
- Minsky, M. Society of mind. Simon and Schuster, 1988.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- Increasing generality in machine learning through procedural content generation. Nature Machine Intelligence, 2, 08 2020. doi: 10.1038/s42256-020-0208-z.
- Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning, pp. 262–270. PMLR, 2017.
- Learning what you can do before doing anything. arXiv preprint arXiv:1806.09655, 2018.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588:604 – 609, 2019. URL https://api.semanticscholar.org/CorpusID:208158225.
- Learning silicon dopant transitions in graphene using scanning transmission electron microscopy. In AI for Accelerated Materials Design-NeurIPS 2023 Workshop, 2023.
- Searle, J. R. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–424, 1980.
- The predictron: End-to-end learning and planning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 3191–3199. PMLR, 2017.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
- Genhowto: Learning to generate actions and state transformations from instructional videos. arXiv preprint arXiv:2312.07322, 2023.
- Steinman, D. A. Image-based computational fluid dynamics modeling in realistic arterial geometries. Annals of biomedical engineering, 30:483–497, 2002.
- Procedural content generation via machine learning (PCGML). IEEE Trans. Games, 10(3):257–270, 2018.
- Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, August 1999. doi: http://dx.doi.org/10.1016/S0004-3702(99)00052-1.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. IEEE, 2017.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18359–18369, 2023a.
- Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6830–6839, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023.
- Art⋅bold-⋅\boldsymbol{\cdot}bold_⋅v: Auto-regressive text-to-video generation with diffusion models, 2023.
- A survey on video diffusion models. arXiv preprint arXiv:2310.10647, 2023.
- If a picture is worth a thousand words is video worth a million? differences in affective and cognitive processing of video and text cases. Journal of Computing in Higher Education, 23:15–37, 2011.
- Temporally consistent transformers for video generation. In International Conference on Machine Learning, pp. 39062–39098. PMLR, 2023.
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022a.
- Probabilistic adaptation of text-to-video models. arXiv preprint arXiv:2306.01872, 2023a.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023b.
- Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022b.
- Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023c.
- Artificial Intelligence and Games. Springer, 2018. https://gameaibook.org.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Become a proficient player with limited data through watching pure videos. In The Eleventh International Conference on Learning Representations, 2022.
- Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023.
- Moviefactory: Automatic movie creation from text using large generative models for language and images. arXiv preprint arXiv:2306.07257, 2023.