Towards Generalist Robot Learning from Internet Video: A Survey (2404.19664v4)
Abstract: Scaling deep learning to massive, diverse internet data has yielded remarkably general capabilities in visual and natural language understanding and generation. However, data has remained scarce and challenging to collect in robotics, seeing robot learning struggle to obtain similarly general capabilities. Promising Learning from Videos (LfV) methods aim to address the robotics data bottleneck by augmenting traditional robot data with large-scale internet video data. This video data offers broad foundational information regarding physical behaviour and the underlying physics of the world, and thus can be highly informative for a generalist robot. In this survey, we present a thorough overview of the emerging field of LfV. We outline fundamental concepts, including the benefits and challenges of LfV. We provide a comprehensive review of current methods for extracting knowledge from large-scale internet video, addressing key challenges in LfV, and boosting downstream robot and reinforcement learning via the use of video data. The survey concludes with a critical discussion of challenges and opportunities in LfV. Here, we advocate for scalable foundation model approaches that can leverage the full range of available internet video to improve the learning of robot policies and dynamics models. We hope this survey can inform and catalyse further LfV research, driving progress towards the development of general-purpose robots.
- Moonvalley - animate your ideas. https://moonvalley.ai/. Accessed: 2024-04-04.
- Pika - empowering creativity. https://pika.art/home. Accessed: 2024-04-04.
- A definition of continual reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Language reward modulation for pretraining reinforcement learning. ArXiv, abs/2308.12270, 2023. URL https://api.semanticscholar.org/CorpusID:261075941.
- Robel: Robotics benchmarks for learning with low-cost robots. In Conference on robot learning, pages 1300–1313. PMLR, 2020.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:247939706.
- Autort: Embodied foundation models for large scale orchestration of robotic agents. arXiv preprint arXiv:2401.12963, 2024.
- Compositional foundation models for hierarchical planning. ArXiv, abs/2309.08587, 2023. URL https://api.semanticscholar.org/CorpusID:262012485.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Affordances in robotic tasks–a survey. arXiv preprint arXiv:2004.07400, 2020.
- A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Vivit: A video vision transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021. URL https://api.semanticscholar.org/CorpusID:232417054.
- A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
- Playing hard exploration games by watching youtube. In Neural Information Processing Systems, 2018. URL https://api.semanticscholar.org/CorpusID:44061126.
- Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
- Test of time: Instilling video-language models with a sense of time. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503–2516, 2023. URL https://api.semanticscholar.org/CorpusID:255440354.
- Human-to-robot imitation in the wild. ArXiv, abs/2207.09450, 2022. URL https://api.semanticscholar.org/CorpusID:248941578.
- Affordances from human videos as a versatile representation for robotics. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 01–13, 2023. URL https://api.semanticscholar.org/CorpusID:258180471.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673.
- Memory consolidation enables long-context video understanding. arXiv preprint arXiv:2402.05861, 2024.
- Videocon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023.
- Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
- V-jepa: Latent video prediction for visual representation learning. 2023.
- Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187, 2023.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Causal deep learning. arXiv preprint arXiv:2303.02186, 2023.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Towards generalizable zero-shot manipulation via translating human interaction plans. ArXiv, abs/2312.00775, 2023. URL https://api.semanticscholar.org/CorpusID:265551754.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Robotic offline rl from internet videos via value-function pre-training. ArXiv, abs/2309.13041, 2023. URL https://api.semanticscholar.org/CorpusID:262217278.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023a.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. ArXiv, abs/2310.10639, 2023b. URL https://api.semanticscholar.org/CorpusID:264172455.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
- Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
- Rt-1: Robotics transformer for real-world control at scale. ArXiv, abs/2212.06817, 2022. URL https://api.semanticscholar.org/CorpusID:254591260.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. ArXiv, abs/2307.15818, 2023. URL https://api.semanticscholar.org/CorpusID:260293142.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.org/CorpusID:218971783.
- Genie: Generative interactive environments. 2024. URL https://api.semanticscholar.org/CorpusID:267897982.
- Exploration by random network distillation. ArXiv, abs/1810.12894, 2018. URL https://api.semanticscholar.org/CorpusID:53115163.
- What makes pre-trained visual representations successful for robust manipulation? ArXiv, abs/2312.12444, 2023. URL https://api.semanticscholar.org/CorpusID:266369884.
- nuscenes: A multimodal dataset for autonomous driving. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. URL https://api.semanticscholar.org/CorpusID:85517967.
- Groot: Learning to follow instructions by watching gameplay videos. ArXiv, abs/2310.08235, 2023. URL https://api.semanticscholar.org/CorpusID:263908999.
- Beyond fine-tuning: Transferring behavior in reinforcement learning. arXiv preprint arXiv:2102.13515, 2021.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Recent advances in robot learning from demonstration. Annu. Rev. Control. Robotics Auton. Syst., 3:297–330, 2020. URL https://api.semanticscholar.org/CorpusID:208958394.
- Learning video-conditioned policies for unseen manipulation tasks. 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 909–916, 2023. URL https://api.semanticscholar.org/CorpusID:258588267.
- Maskgit: Masked generative image transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11305–11315, 2022a. URL https://api.semanticscholar.org/CorpusID:246680316.
- Semantic visual navigation by watching youtube videos. ArXiv, abs/2006.10034, 2020. URL https://api.semanticscholar.org/CorpusID:219721405.
- Learning value functions from undirected state-only experience. ArXiv, abs/2204.12458, 2022b. URL https://api.semanticscholar.org/CorpusID:245064979.
- Look ma, no hands! agent-environment factorization of egocentric videos. ArXiv, abs/2305.16301, 2023. URL https://api.semanticscholar.org/CorpusID:258888066.
- Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Conference on Robot Learning, pages 3909–3928. PMLR, 2023.
- Learning generalizable robotic reward functions from ”in-the-wild” human videos. ArXiv, abs/2103.16817, 2021a. URL https://api.semanticscholar.org/CorpusID:232428118.
- Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
- Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023b.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023c.
- An empirical study of training self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9620–9629, 2021b. URL https://api.semanticscholar.org/CorpusID:233024948.
- Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
- Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
- Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
- From play to policy: Conditional behavior generation from uncurated robot data. ArXiv, abs/2210.10047, 2022. URL https://api.semanticscholar.org/CorpusID:252968170.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Epic-kitchens visor benchmark: Video segmentations and object relations. Advances in Neural Information Processing Systems, 35:13745–13758, 2022.
- An unbiased look at datasets for visuo-motor pre-training. ArXiv, abs/2310.09289, 2023. URL https://api.semanticscholar.org/CorpusID:263914653.
- Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Stochastic video generation with a learned prior. In International conference on machine learning, pages 1174–1183. PMLR, 2018.
- Mose: A new dataset for video object segmentation in complex scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20224–20234, 2023a.
- Clip4mc: An rl-friendly vision-language model for minecraft. ArXiv, abs/2303.10571, 2023b. URL https://api.semanticscholar.org/CorpusID:257632482.
- Compositional generative modeling: A single model is not all you need. ArXiv, abs/2402.01103, 2024. URL https://api.semanticscholar.org/CorpusID:267406745.
- Learning universal policies via text-guided video generation. ArXiv, abs/2302.00111, 2023a. URL https://api.semanticscholar.org/CorpusID:256459809.
- Video language planning. ArXiv, abs/2310.10625, 2023b. URL https://api.semanticscholar.org/CorpusID:264172935.
- Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023c.
- Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019.
- Imitating latent policies from observation. ArXiv, abs/1805.07914, 2018. URL https://api.semanticscholar.org/CorpusID:29156793.
- Estimating q(s, s’) with deep deterministic dynamics gradients. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:211258729.
- Video prediction models as rewards for reinforcement learning. ArXiv, abs/2305.14343, 2023. URL https://api.semanticscholar.org/CorpusID:258841355.
- Taming transformers for high-resolution image synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12868–12878, 2020. URL https://api.semanticscholar.org/CorpusID:229297973.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Learning by watching: A review of video-based learning approaches for robot manipulation. arXiv preprint arXiv:2402.07127, 2024.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- D4rl: Datasets for deep data-driven reinforcement learning, 2020.
- Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3170–3180, 2022.
- Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
- Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
- Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pages 11321–11339. PMLR, 2023.
- Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 971–980, 2017.
- Omnimae: Single model masked pretraining on images and videos. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10406–10417, 2022. URL https://api.semanticscholar.org/CorpusID:249712367.
- The “something something” video database for learning and evaluating visual common sense. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5843–5851, 2017. URL https://api.semanticscholar.org/CorpusID:834612.
- Ego4d: Around the world in 3,000 hours of egocentric video. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990, 2021. URL https://api.semanticscholar.org/CorpusID:238856888.
- Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. arXiv preprint arXiv:2311.18259, 2023.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.
- Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023.
- ” task success” is not enough: Investigating the use of video-language models as behavior critics for catching undesirable agent behaviors. arXiv preprint arXiv:2402.04210, 2024.
- Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
- Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Learning latent dynamics for planning from pixels. ArXiv, abs/1811.04551, 2018. URL https://api.semanticscholar.org/CorpusID:53280207.
- Mastering diverse domains through world models. ArXiv, abs/2301.04104, 2023. URL https://api.semanticscholar.org/CorpusID:255569874.
- General-purpose, long-context autoregressive modeling with perceiver ar. In International Conference on Machine Learning, pages 8535–8558. PMLR, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020. URL https://api.semanticscholar.org/CorpusID:219955663.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
- Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023a. URL https://api.semanticscholar.org/CorpusID:263310665.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023b.
- For pre-trained vision models in motor control, not all policy learning methods are created equal. In International Conference on Machine Learning, 2023c. URL https://api.semanticscholar.org/CorpusID:258048578.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023a.
- Diffusion reward: Learning rewards via conditional video diffusion. arXiv preprint arXiv:2312.14134, 2023b.
- Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:250451569.
- How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721, 2021.
- Eric Jang. All neural networks, all autonomous, all 1x speed. https://www.1x.tech/discover/all-neural-networks-all-autonomous-all-1x-speed, Feb 2024. Accessed: 2024-04-10.
- When to trust your model: Model-based policy optimization. ArXiv, abs/1906.08253, 2019. URL https://api.semanticscholar.org/CorpusID:195068981.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Object-centric slot diffusion. arXiv preprint arXiv:2303.10834, 2023.
- Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
- Exploring visual pre-training for robot manipulation: Datasets, models and methods. ArXiv, abs/2308.03620, 2023. URL https://api.semanticscholar.org/CorpusID:254198890.
- Gradient-based planning with world models. ArXiv, abs/2312.17227, 2023. URL https://api.semanticscholar.org/CorpusID:266573170.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Language-driven representation learning for robotics. ArXiv, abs/2302.12766, 2023. URL https://api.semanticscholar.org/CorpusID:257205716.
- Voila: Visual-observation-only imitation learning for autonomous navigation. 2022 International Conference on Robotics and Automation (ICRA), pages 2497–2503, 2021. URL https://api.semanticscholar.org/CorpusID:234790310.
- Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023a.
- A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925, 2023b.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Droid: A large-scale in-the-wild robot manipulation dataset. 2024.
- Giving robots a hand: Learning generalizable manipulation with eye-in-hand human video demonstrations. ArXiv, abs/2307.05959, 2023. URL https://api.semanticscholar.org/CorpusID:259836885.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, 2023.
- Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166, 2023.
- Learning to act from actionless videos through dense correspondences. ArXiv, abs/2310.08576, 2023. URL https://api.semanticscholar.org/CorpusID:263908842.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- A review of robot learning for manipulation: Challenges, representations, and algorithms. Journal of machine learning research, 22(30):1–82, 2021.
- The darpa robotics challenge finals: Results and perspectives. The DARPA robotics challenge finals: Humanoid robots to the rescue, pages 1–26, 2018.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Graph inverse reinforcement learning from diverse videos. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:251223373.
- Robohive: A unified framework for robot learning. Advances in Neural Information Processing Systems, 36, 2024.
- The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020.
- Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
- In-context reinforcement learning with algorithm distillation. ArXiv, abs/2210.14215, 2022. URL https://api.semanticscholar.org/CorpusID:253107613.
- Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
- Supervised pretraining can learn in-context reinforcement learning. ArXiv, abs/2306.14892, 2023. URL https://api.semanticscholar.org/CorpusID:259262142.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ArXiv, abs/2005.01643, 2020. URL https://api.semanticscholar.org/CorpusID:218486979.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022a.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022b.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Unmasked teacher: Towards training-efficient video foundation models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 19891–19903, 2023c. URL https://api.semanticscholar.org/CorpusID:257771777.
- Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
- Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022c.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Steve-1: A generative model for text-to-behavior in minecraft. ArXiv, abs/2306.00937, 2023. URL https://api.semanticscholar.org/CorpusID:258999563.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
- Efficient neural radiance fields for interactive free-viewpoint video. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 2024a.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023a.
- World model on million-length video and language with blockwise ringattention. 2024b. URL https://api.semanticscholar.org/CorpusID:268385142.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024c.
- Learning to identify critical states for reinforcement learning from videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1955–1965, 2023c.
- Plan your target and learn your skills: Transferable state-only imitation learning via decoupled policy optimization. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:247244605.
- Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
- Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
- Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
- Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. ArXiv, abs/2210.00030, 2022. URL https://api.semanticscholar.org/CorpusID:252683397.
- Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:258999195.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Chris A Mack. Fifty years of moore’s law. IEEE Transactions on semiconductor manufacturing, 24(2):202–207, 2011.
- Where are we in the search for an artificial visual cortex for embodied intelligence? ArXiv, abs/2303.18240, 2023. URL https://api.semanticscholar.org/CorpusID:257901087.
- Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
- Dexvip: Learning dexterous grasping with human hand pose priors from video. ArXiv, abs/2202.00164, 2022. URL https://api.semanticscholar.org/CorpusID:237369373.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.
- Structured world models from human videos. ArXiv, abs/2308.10901, 2023. URL https://api.semanticscholar.org/CorpusID:259336798.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019. URL https://api.semanticscholar.org/CorpusID:182952863.
- A survey on video prediction: From deterministic to generative approaches. ArXiv, abs/2401.14718, 2024. URL https://api.semanticscholar.org/CorpusID:267301200.
- Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021.
- Gordon E Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82–85, 1998.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. ArXiv, abs/2305.15021, 2023. URL https://api.semanticscholar.org/CorpusID:258865718.
- Shaping embodied agent behavior with activity-context priors from egocentric video. ArXiv, abs/2110.07692, 2021. URL https://api.semanticscholar.org/CorpusID:239009498.
- Learning audio-video modalities from image captions. In European Conference on Computer Vision, pages 407–426. Springer, 2022.
- R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:247618840.
- Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog, 2022.
- Pivot: Iterative visual prompting elicits actionable knowledge for vlms. ArXiv, abs/2402.07872, 2024. URL https://api.semanticscholar.org/CorpusID:267627797.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- A simple recipe for contrastively pre-training video-first encoders beyond 16 frames. ArXiv, abs/2312.07395, 2023. URL https://api.semanticscholar.org/CorpusID:266174654.
- The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:247292805.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
- Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems, 36, 2024.
- Curiosity-driven exploration by self-supervised prediction. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 488–489, 2017. URL https://api.semanticscholar.org/CorpusID:20045336.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Sfv: Reinforcement learning of physical skills from videos. ACM Trans. Graph., 37:178, 2018. URL https://api.semanticscholar.org/CorpusID:52937281.
- Keyframing the future: Keyframe discovery for visual prediction and planning. In Conference on Learning for Dynamics & Control, 2019. URL https://api.semanticscholar.org/CorpusID:218571383.
- Cross-domain transfer via semantic skill imitation. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:254636470.
- Robot learning. Springer Handbook of Robotics, pages 357–398, 2016.
- Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724, 2023.
- Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, 2021. URL https://api.semanticscholar.org/CorpusID:236986915.
- From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation. IEEE Robotics and Automation Letters, 7:10873–10881, 2022. URL https://api.semanticscholar.org/CorpusID:248392006.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252718704.
- A generalist agent. Trans. Mach. Learn. Res., 2022, 2022. URL https://api.semanticscholar.org/CorpusID:248722148.
- Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610, 2022.
- Frankmocap: Fast monocular 3d hand and body motion capture by regression and integration. arXiv preprint arXiv:2008.08324, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252367910.
- Learning what you can do before doing anything. In International Conference on Learning Representations, 2018. URL https://api.semanticscholar.org/CorpusID:60441438.
- Clockwork variational autoencoders. Advances in Neural Information Processing Systems, 34:29246–29257, 2021.
- Self-supervised learning for videos: A survey. ACM Computing Surveys, 55(13s):1–37, 2023.
- Learning predictive models from observation and interaction. In European Conference on Computer Vision, 2019. URL https://api.semanticscholar.org/CorpusID:209515451.
- Reinforcement learning with videos: Combining offline observations with interaction. ArXiv, abs/2011.06507, 2020. URL https://api.semanticscholar.org/CorpusID:226306712.
- Learning to act without actions. ArXiv, abs/2312.10812, 2023. URL https://api.semanticscholar.org/CorpusID:266359570.
- Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588:604 – 609, 2019. URL https://api.semanticscholar.org/CorpusID:208158225.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Reinforcement learning with action-free pre-training from videos. ArXiv, abs/2203.13880, 2022a. URL https://api.semanticscholar.org/CorpusID:247762941.
- Harp: Autoregressive latent video prediction with high-fidelity image generator. 2022 IEEE International Conference on Image Processing (ICIP), pages 3943–3947, 2022b. URL https://api.semanticscholar.org/CorpusID:252280733.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018.
- Robovqa: Multimodal long-horizon reasoning for robotics. arXiv preprint arXiv:2311.00899, 2023.
- On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning, pages 492–504. PMLR, 2023.
- Understanding human hands in contact at internet scale. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9866–9875, 2020. URL https://api.semanticscholar.org/CorpusID:215413188.
- Self-supervised disentangled representation learning for third-person imitation learning. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 214–221, 2021. URL https://api.semanticscholar.org/CorpusID:236772575.
- Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40:1419 – 1434, 2020. URL https://api.semanticscholar.org/CorpusID:220069237.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:254408735.
- Graph-structured visual imitation. In Conference on Robot Learning, pages 979–989. PMLR, 2020.
- What do we learn from a large-scale study of pre-trained visual representations in sim and real environments? ArXiv, abs/2310.02219, 2023. URL https://api.semanticscholar.org/CorpusID:263608779.
- Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. ArXiv, abs/2202.10448, 2022. URL https://api.semanticscholar.org/CorpusID:247011104.
- Alice Sjöberg. How many videos are there on youtube? https://www.dexerto.com/entertainment/how-many-videos-are-there-on-youtube-2197264/, December 2023.
- Avid: Learning multi-stage tasks via pixel-level translation of human videos. ArXiv, abs/1912.04443, 2019. URL https://api.semanticscholar.org/CorpusID:209140723.
- Introducing rfm-1: Giving robots human-like reasoning capabilities. https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/, March 2024. Accessed: 2024-03-29.
- Roboclip: One demonstration is enough to learn robot policies. ArXiv, abs/2310.07899, 2023. URL https://api.semanticscholar.org/CorpusID:263909538.
- Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
- Third-person imitation learning. arXiv preprint arXiv:1703.01703, 2017.
- Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
- Preventing mode collapse when imitating latent policies from observations. 2022.
- Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pages 216–224. Elsevier, 1990.
- Reinforcement learning: An introduction. MIT press, 2018.
- Semantic exploration from language abstractions and pretrained representations. ArXiv, abs/2204.05080, 2022. URL https://api.semanticscholar.org/CorpusID:248085427.
- Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432, 2023.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023a.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Octo: An open-source generalist robot policy, 2023b.
- Plex: Making the most of the available data for robotic manipulation pretraining. ArXiv, abs/2303.08789, 2023. URL https://api.semanticscholar.org/CorpusID:257532588.
- Behavior priors for efficient reinforcement learning. Journal of Machine Learning Research, 23(221):1–68, 2022.
- Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. URL https://api.semanticscholar.org/CorpusID:2413610.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
- Video-guided skill discovery. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. ArXiv, abs/2203.12602, 2022. URL https://api.semanticscholar.org/CorpusID:247619234.
- Behavioral cloning from observation. In International Joint Conference on Artificial Intelligence, 2018a. URL https://api.semanticscholar.org/CorpusID:23206414.
- Generative adversarial imitation from observation. ArXiv, abs/1807.06158, 2018b. URL https://api.semanticscholar.org/CorpusID:49863329.
- Recent advances in imitation learning from observation. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:173188327.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
- Neural discrete representation learning. ArXiv, abs/1711.00937, 2017. URL https://api.semanticscholar.org/CorpusID:20282961.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- High fidelity video prediction with large stochastic recurrent neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Phenaki: Variable length video generation from open domain textual description. ArXiv, abs/2210.02399, 2022. URL https://api.semanticscholar.org/CorpusID:252715594.
- Generating videos with scene dynamics. Advances in neural information processing systems, 29, 2016.
- Vrl3: A data-driven framework for visual deep reinforcement learning. ArXiv, abs/2202.10324, 2022a. URL https://api.semanticscholar.org/CorpusID:247011862.
- Mimicplay: Long-horizon imitation learning by watching human play. ArXiv, abs/2302.12422, 2023a. URL https://api.semanticscholar.org/CorpusID:257205825.
- Manipulate by seeing: Creating manipulation controllers from pre-trained representations. ArXiv, abs/2303.08135, 2023b. URL https://api.semanticscholar.org/CorpusID:257505038.
- Videomae v2: Scaling video masked autoencoders with dual masking. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023c. URL https://api.semanticscholar.org/CorpusID:257805127.
- Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6312–6322, 2022b. URL https://api.semanticscholar.org/CorpusID:254408955.
- Optimal goal-reaching reinforcement learning via quasimetric learning. In International Conference on Machine Learning, pages 36411–36430. PMLR, 2023d.
- Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:2401.04468, 2024a.
- Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024b.
- Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023e.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, abs/2307.06942, 2023f. URL https://api.semanticscholar.org/CorpusID:259847783.
- Diffusion policies as an expressive policy class for offline reinforcement learning. ArXiv, abs/2208.06193, 2022c. URL https://api.semanticscholar.org/CorpusID:251554821.
- Wayve. Lingo-1: Exploring natural language for autonomous driving. https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/, 2024. Accessed: 2024-04-04.
- Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019.
- Any-point trajectory modeling for policy learning. ArXiv, abs/2401.00025, 2023. URL https://api.semanticscholar.org/CorpusID:266693687.
- Learning 3d particle-based simulators from rgb-d videos. arXiv preprint arXiv:2312.05359, 2023.
- Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2318–2328, 2021.
- Unleashing large-scale video generative pre-training for visual robot manipulation. ArXiv, abs/2312.13139, 2023a. URL https://api.semanticscholar.org/CorpusID:266374724.
- Pre-training contextualized world models with in-the-wild videos for reinforcement learning. ArXiv, abs/2305.18499, 2023b. URL https://api.semanticscholar.org/CorpusID:258967679.
- Slotformer: Unsupervised visual dynamics simulation with object-centric models. arXiv preprint arXiv:2210.05861, 2022.
- Compositional transfer in hierarchical reinforcement learning. arXiv: Learning, 2019. URL https://api.semanticscholar.org/CorpusID:213142736.
- Foundations for transfer in reinforcement learning: A taxonomy of knowledge modalities. arXiv preprint arXiv:2312.01939, 2023.
- Towards generalist robots: A promising paradigm via generative simulation. 2023. URL https://api.semanticscholar.org/CorpusID:259202431.
- Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
- Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics, 2024.
- Learning by watching: Physical imitation of manipulation skills from human videos. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834, 2021. URL https://api.semanticscholar.org/CorpusID:231632575.
- Robotube: Learning household manipulation from human videos with simulated twin environments. In Conference on Robot Learning, 2022. URL https://api.semanticscholar.org/CorpusID:250164181.
- Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:238215257.
- Xskill: Cross embodiment skill discovery. ArXiv, abs/2307.09955, 2023. URL https://api.semanticscholar.org/CorpusID:259982636.
- Spatial-temporal transformer networks for traffic flow forecasting. ArXiv, abs/2001.02908, 2020. URL https://api.semanticscholar.org/CorpusID:210116815.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022.
- Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. 2022. URL https://api.semanticscholar.org/CorpusID:254535696.
- Videogpt: Video generation using vq-vae and transformers. ArXiv, abs/2104.10157, 2021. URL https://api.semanticscholar.org/CorpusID:233307257.
- Temporally consistent transformers for video generation. In International Conference on Machine Learning, pages 39062–39098. PMLR, 2023.
- Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. arXiv preprint arXiv:2310.15145, 2023a.
- Dichotomy of control: Separating what you can control from what you cannot. ArXiv, abs/2210.13435, 2022. URL https://api.semanticscholar.org/CorpusID:253098210.
- Probabilistic adaptation of text-to-video models. ArXiv, abs/2306.01872, 2023b. URL https://api.semanticscholar.org/CorpusID:259075709.
- Learning interactive real-world simulators. ArXiv, abs/2310.06114, 2023c. URL https://api.semanticscholar.org/CorpusID:263830899.
- Foundation models for decision making: Problems, methods, and opportunities. ArXiv, abs/2303.04129, 2023d. URL https://api.semanticscholar.org/CorpusID:257378587.
- Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139, 2024.
- Aim: Adapting image models for efficient video action recognition. ArXiv, abs/2302.03024, 2023e. URL https://api.semanticscholar.org/CorpusID:256615635.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Improving sample efficiency in model-free reinforcement learning from images. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:203737314.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. ArXiv, abs/2107.09645, 2021. URL https://api.semanticscholar.org/CorpusID:236134152.
- Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022.
- Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance. ArXiv, abs/2310.02635, 2023. URL https://api.semanticscholar.org/CorpusID:263620344.
- Visual imitation made easy. In Conference on Robot Learning, 2020. URL https://api.semanticscholar.org/CorpusID:221095826.
- Magvit: Masked generative video transformer. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10459–10469, 2022. URL https://api.semanticscholar.org/CorpusID:254563906.
- Language model beats diffusion – tokenizer is key to visual generation. 2023a. URL https://api.semanticscholar.org/CorpusID:263830733.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
- Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023b.
- Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023c.
- General flow as foundation affordance for scalable robot learning. ArXiv, abs/2401.11439, 2024. URL https://api.semanticscholar.org/CorpusID:267069070.
- Dmotion: Robotic visuomotor control with unsupervised forward model learned from videos. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7135–7142, 2021. URL https://api.semanticscholar.org/CorpusID:232146710.
- Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, 2021. URL https://api.semanticscholar.org/CorpusID:235368061.
- Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, pages 127–145. Springer, 2022.
- Videoprism: A foundational visual encoder for video understanding. 2024. URL https://api.semanticscholar.org/CorpusID:267760035.
- What makes representation learning from videos hard for control? 2022. URL https://api.semanticscholar.org/CorpusID:252635608.
- Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE symposium series on computational intelligence (SSCI), pages 737–744. IEEE, 2020.
- 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Stereo magnification. ACM Transactions on Graphics (TOG), 37:1 – 12, 2018. URL https://api.semanticscholar.org/CorpusID:219893035.
- Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning, pages 1719–1735. PMLR, 2021.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023a.
- Guiding online reinforcement learning with action-free offline pretraining. ArXiv, abs/2301.12876, 2023b. URL https://api.semanticscholar.org/CorpusID:256390557.
- Robot parkour learning. arXiv preprint arXiv:2309.05665, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.