RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation (2402.15487v2)
Abstract: We introduce the novel task of interactive scene exploration, wherein robots autonomously explore environments and produce an action-conditioned scene graph (ACSG) that captures the structure of the underlying environment. The ACSG accounts for both low-level information (geometry and semantics) and high-level information (action-conditioned relationships between different entities) in the scene. To this end, we present the Robotic Exploration (RoboEXP) system, which incorporates the Large Multimodal Model (LMM) and an explicit memory design to enhance our system's capabilities. The robot reasons about what and how to explore an object, accumulating new information through the interaction process and incrementally constructing the ACSG. Leveraging the constructed ACSG, we illustrate the effectiveness and efficiency of our RoboEXP system in facilitating a wide range of real-world manipulation tasks involving rigid, articulated objects, nested objects, and deformable objects.
- Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments. RA-L, 2019.
- Robotic urban search and rescue: A survey from the control perspective. Journal of Intelligent & Robotic Systems, 2013.
- Scientific exploration of challenging planetary analog environments with a team of legged robots. Science robotics, 2023.
- The arches space-analogue demonstration mission: Towards heterogeneous teams of autonomous robots for collaborative scientific sampling in planetary exploration. RA-L, 2020.
- Esc: Exploration with soft commonsense constraints for zero-shot object navigation. ICML, 2023.
- Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 2022.
- Mapping instructions to actions in 3d environments with visual goal prediction. arXiv preprint arXiv:1809.00786, 2018.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020.
- R.Ā Bajcsy. Active perception. Proceedings of the IEEE, 1988.
- Active perception vs. passive perception. In Proc. of IEEE Workshop on Computer Vision, 1985.
- Receding horizon" next-best-view" planner for 3d exploration. In ICRA, 2016.
- A shadowcasting-based next-best-view planner for autonomous 3d exploration. RA-L, 2022.
- Online next-best-view planner for 3d-exploration and inspection with a mobile manipulator robot. RA-L, 2022.
- Learning active camera for multi-object navigation. NeurIPS, 2022a.
- Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020.
- Structure from action: Learning interactions for articulated object 3d structure discovery. arXiv preprint arXiv:2207.08997, 2022.
- Information-theoretic planning with trajectory optimization for dense 3d mapping. In RSS, 2015.
- Uncertainty-driven planner for exploration and navigation. In ICRA, 2022.
- Uncertainty-aware receding horizon exploration and mapping using aerial robots. In ICRA, 2017.
- Large-scale study of curiosity-driven learning. In ICLR, 2019.
- Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Embodied question answering. In CVPR, 2018.
- Learning to poke by poking: Experiential learning of intuitive physics. NeurIPS, 2016.
- AdaAfford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. In ECCV, 2022.
- Active perception and reinforcement learning. In Machine Learning Proceedings 1990. 1990.
- Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. RA-L, 2020a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
- Segment anything in high quality. arXiv preprint arXiv: 2306.01567, 2023.
- Segment anything. arXiv:2304.02643, 2023.
- Learning transferable visual models from natural language supervision. ICML, 2021.
- OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV System Card.pdf, 2023a.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv: 2309.17421, 2023a.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv: 2311.17842, 2023.
- Image retrieval using scene graphs. In CVPR, 2015.
- Scene graph generation by iterative message passing. In CVPR, 2017.
- Visual relationship detection with language priors. In ECCV, 2016.
- Localize, assemble, and predicate: Contextual object proposal embedding for visual relation detection. In AAAI, 2020.
- Visual translation embedding network for visual relation detection. In CVPR, 2017.
- Graph r-cnn for scene graph generation. In ECCV, 2018a.
- 3d scene graph: A structure for unified semantics, 3d space, and camera. In ICCV, 2019.
- Indoor and outdoor 3d scene graph generation via language-enabled spatial ontologies. arXiv preprint arXiv:2312.11713, 2023.
- Auto-encoding scene graphs for image captioning. In CVPR, 2019.
- A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017.
- Image generation from scene graphs. In CVPR, 2018.
- Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072, 2020.
- Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks. In ICRA, 2022.
- Graphmapper: Efficient visual navigation by scene graph generation. In ICPR, 2022.
- Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
- Taskography: Evaluating robot task planning over large 3d scene graphs. In CoRL, 2022.
- Learning reusable manipulation strategies. In CoRL, 2023.
- Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
- Programmatically grounded, compositionally generalizable robotic manipulation. ICLR, 2023.
- Compositional Diffusion-Based Continuous Constraint Solvers. In CoRL, 2023b.
- Composable part-based manipulation. In CoRL, 2023b.
- Comphy: Compositional physical reasoning of objects and events from videos. In ICLR, 2022b.
- PDSketch: Integrated Domain Programming, Learning, and Planning. In NeurIPS, 2022.
- Robot exploration in unknown cluttered environments when dealing with uncertainty. In IRIS, 2017.
- A learning-based semi-autonomous controller for robotic exploration of unknown disaster scenes while searching for victims. IEEE Transactions on Cybernetics, 2014.
- Exploration strategies based on multi-criteria decision making for searching environments in rescue operations. Autonomous Robots, 2011.
- Energy-efficient mobile robot exploration. In ICRA, 2006.
- Speeding-up robot exploration by exploiting background information. RA-L, 2016.
- Uavs beneath the surface: Cooperative autonomy for subterranean search and rescue in darpa subt. arXiv preprint arXiv:2206.08185, 2022.
- Cerberus in the darpa subterranean challenge. Science Robotics, 2022.
- Lunares: Lunar crater exploration with heterogeneous multi robot systems. Intelligent Service Robotics, 2011.
- Where to map? iterative rover-copter path planning for mars exploration. RA-L, 2020.
- Peanut: predicting and navigating to unseen targets. In ICCV, 2023.
- Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543, 2018b.
- Thda: Treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
- Simple but effective: Clip embeddings for embodied ai. In CVPR, 2022.
- Offline visual representation learning for embodied navigation. arXiv preprint arXiv:2204.13226, 2022.
- Learning object relation graph and tentative policy for visual navigation. In ECCV, 2020.
- Auxiliary tasks and exploration enable objectgoal navigation. In ICCV, 2021.
- Semantic visual navigation by watching youtube videos. NeurIPS, 2020.
- Procthor: Large-scale embodied ai using procedural generation. NeurIPS, 2022.
- Sim2real predictivity: Does evaluation in simulation predict real-world performance? RA-L, 2020.
- Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357, 2019.
- Learning to map for active semantic goal navigation. arXiv preprint arXiv:2106.15648, 2021.
- Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
- Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155, 2020.
- Stubborn: A strong baseline for indoor object navigation. In IROS, 2022.
- Poni: Potential functions for objectgoal navigation with interaction-free learning. In CVPR, 2022.
- Vlfm: Vision-language frontier maps for zero-shot semantic navigation. arXiv preprint arXiv:2312.03275, 2023.
- Think, act, and ask: Open-world interactive personalized robot navigation. arXiv preprint arXiv:2310.07968, 2023.
- Visual room rearrangement. In CVPR, 2021.
- Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020.
- Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation. arXiv preprint arXiv:2008.07792, 2020b.
- Manipulathor: A framework for visual object manipulation. In CVPR, 2021.
- Information gain-based exploration using rao-blackwellized particle filters. In RSS, 2005.
- Autonomous exploration under uncertainty via deep reinforcement learning on graphs. In IROS, 2020.
- Interesting object, curious agent: Learning task-agnostic exploration. In NeurIPS, 2021.
- Representation granularity enables time-efficient autonomous exploration in large, complex worlds. Science Robotics, 2023.
- Learning to push by grasping: Using multiple tasks for effective learning. In ICRA, 2017.
- Active exploration for robotic manipulation. In IROS, 2022.
- Ditto in the house: Building articulation models of indoor scenes through interactive perception. arXiv preprint arXiv:2302.01295, 2023.
- Active sensor planning for multiview vision tasks. 2008.
- Task-oriented active perception and planning in environments with partially known semantics. In ICML, 2020.
- Building kinematic and dynamic models of articulated objects with multi-modal interactive perception. In 2017 AAAI Spring Symposium Series, 2017.
- Ditto: Building digital twins of articulated objects from interaction. In CVPR, 2022.
- Learning the next best view for 3d point clouds via topological features. In ICRA, 2021.
- Next-best view policy for 3d reconstruction. In ECCV Workshops. Springer, 2020.
- 3d shapenets for 2.5 d object recognition and next-best-view prediction. arXiv preprint arXiv:1406.5670, 2014.
- A double branch next-best-view network and novel robot system for active object reconstruction. In ICRA, 2022.
- Active in-hand object recognition on a humanoid robot. IEEE Transactions on Robotics, 2014.
- Towards accurate active camera localization. In ECCV, 2022.
- Sam-rl: Sensing-aware model-based reinforcement learning via differentiable physics-based simulation and rendering. RSS, 2023.
- Active perception and representation for robotic manipulation. arXiv preprint arXiv:2003.06734, 2020.
- Active perception: Interactive manipulation for improving object detection. Standford University Journal, 2008.
- Active-perceptive motion generation for mobile manipulation. arXiv preprint arXiv:2310.00433, 2023.
- Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
- RĀ OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023b.
- Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023c.
- Code as policies: Language model programs for embodied control. In ICRA, 2023.
- Distilled feature fields enable few-shot language-guided manipulation. In CoRL, 2023.
- Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In CoRL, 2023.
- Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv: 2309.16650, 2023.