ROSO: Improving Robotic Policy Inference via Synthetic Observations (2311.16680v2)
Abstract: In this paper, we propose the use of generative AI to improve zero-shot performance of a pre-trained policy by altering observations during inference. Modern robotic systems, powered by advanced neural networks, have demonstrated remarkable capabilities on pre-trained tasks. However, generalizing and adapting to new objects and environments is challenging, and fine-tuning visuomotor policies is time-consuming. To overcome these issues we propose Robotic Policy Inference via Synthetic Observations (ROSO). ROSO uses stable diffusion to pre-process a robot's observation of novel objects during inference time to fit within its distribution of observations of the pre-trained policies. This novel paradigm allows us to transfer learned knowledge from known tasks to previously unseen scenarios, enhancing the robot's adaptability without requiring lengthy fine-tuning. Our experiments show that incorporating generative AI into robotic inference significantly improves successful outcomes, finishing up to 57% of tasks otherwise unsuccessful with the pre-trained policy.
- RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv:2307.15818, 2023.
- InstructPix2Pix: learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
- End-to-end object detection with transformers. In Computer Vision – ECCV 2020, pages 213–229, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (ICLR), 2021.
- Google scanned objects: A high-quality dataset of 3D scanned household items. In IEEE International Conference on Robotics and Automation (ICRA), 2022.
- Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation (ICRA), 2016.
- Segment anything. arXiv:2304.02643, 2023.
- Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018.
- Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv:2303.05499, 2023.
- Simple open-vocabulary object detection. In Computer Vision – ECCV 2022, pages 728–755, 2022.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8763, 2021.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 8821–8831. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, volume 35, pages 36479–36494, 2022.
- CLIPort: what and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning, volume 164, pages 894–906. PMLR, 2022.
- Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017.
- Scaling robot learning with semantically imagined experience. arXiv:2302.11550, 2023.
- Transporter networks: Rearranging the visual world for robotic manipulation. In Proceedings of the 2020 Conference on Robot Learning (CoRL), 2021.
- Socratic Models: composing zero-shot multimodal reasoning with language. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
- Modular deep q networks for sim-to-real transfer of visuo-motor policies. In Australasian Conference on Robotics and Automation (ACRA), 2017.