Behavior Generation with Latent Actions (2403.03181v2)
Abstract: Generative modeling of complex behaviors from labeled datasets has been a longstanding problem in decision making. Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction. A recent class of models called Behavior Transformers (BeT) addresses this by discretizing actions using k-means clustering to capture different modes. However, k-means struggles to scale for high-dimensional action spaces or long sequences, and lacks gradient information, and thus BeT suffers in modeling long-range actions. In this work, we present Vector-Quantized Behavior Transformer (VQ-BeT), a versatile model for behavior generation that handles multimodal action prediction, conditional generation, and partial observations. VQ-BeT augments BeT by tokenizing continuous actions with a hierarchical vector quantization module. Across seven environments including simulated manipulation, autonomous driving, and robotics, VQ-BeT improves on state-of-the-art models such as BeT and Diffusion Policies. Importantly, we demonstrate VQ-BeT's improved ability to capture behavior modes while accelerating inference speed 5x over Diffusion Policies. Videos and code can be found https://sjlee.cc/vq-bet
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
- On the “bang-bang” control problem. Quarterly of Applied Mathematics, 14(1):11–18, 1956.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bushaw, D. W. Differential equations with a discontinuous forcing term. PhD thesis, Princeton University, 1952.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
- Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pp. 2012–2029. PMLR, 2023.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022.
- Continuous control with action quantization from demonstrations. arXiv preprint arXiv:2110.10149, 2021.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49–58. PMLR, 2016.
- Implicit behavioral cloning. In Conference on Robot Learning, pp. 158–168. PMLR, 2022.
- Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
- Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Safe local motion planning with self-supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12732–12741, 2021.
- St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pp. 533–549. Springer, 2022.
- Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17853–17862, 2023.
- Discrete factorial representations as an abstraction for goal conditioned reinforcement learning. arXiv preprint arXiv:2211.00247, 2022.
- Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
- Learning objective functions for manipulation. In 2013 IEEE International Conference on Robotics and Automation, pp. 1331–1336. IEEE, 2013.
- The design of stretch: A compact, lightweight mobile manipulator for indoor human environments. In 2022 International Conference on Robotics and Automation (ICRA), pp. 3150–3157. IEEE, 2022.
- Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision, pp. 353–369. Springer, 2022.
- Automating reinforcement learning with example-based resets. IEEE Robotics and Automation Letters, 7(3):6606–6613, 2022.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
- Action-quantized offline reinforcement learning for robotic skill learning. In Conference on Robot Learning, pp. 1348–1361. PMLR, 2023.
- Learning latent plans from play. In Conference on robot learning, pp. 1113–1132. PMLR, 2020.
- Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879–893. PMLR, 2018.
- Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023a.
- A language agent for autonomous driving. arXiv preprint arXiv:2311.10813, 2023b.
- Choreographer: Learning and adapting skills in imagination. arXiv preprint arXiv:2211.13350, 2022.
- Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.
- Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
- Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pp. 188–204. PMLR, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914–2924, 2020.
- Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
- On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
- Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020.
- Learning options in reinforcement learning. In Abstraction, Reformulation, and Approximation: 5th International Symposium, SARA 2002 Kananaskis, Alberta, Canada August 2–4, 2002 Proceedings 5, pp. 212–223. Springer, 2002.
- Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Action branching architectures for deep reinforcement learning. In Proceedings of the aaai conference on artificial intelligence, volume 32, 2018.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
- Perceive, attend, and drive: Learning spatial attention for safe self-driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 4875–4881. IEEE, 2021.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8660–8669, 2019.
- Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
- Masked audio generation using a single non-autoregressive transformer. arXiv preprint arXiv:2401.04577, 2024.
- Seungjae Lee (45 papers)
- Yibin Wang (26 papers)
- Haritheja Etukuru (3 papers)
- H. Jin Kim (58 papers)
- Nur Muhammad Mahi Shafiullah (9 papers)
- Lerrel Pinto (81 papers)