Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models (2309.15278v3)
Abstract: Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.
- H. P. Moravec, “Sensor fusion in certainty grids for mobile robots,” AI Magazine, vol. 9, no. 2, p. 61, Jun. 1988. [Online]. Available: https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/676
- J. J. Leonard, H. F. Durrant-Whyte, and I. J. Cox, “Dynamic map building for an autonomous mobile robot,” The International Journal of Robotics Research, vol. 11, no. 4, pp. 286–298, 1992. [Online]. Available: https://doi.org/10.1177/027836499201100402
- B. Kuipers, “The spatial semantic hierarchy,” Artificial Intelligence, vol. 119, no. 1, pp. 191–233, 2000. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0004370200000175
- E. Chown, “Making predictions in an uncertain world: Environmental structure and cognitive maps,” Adaptive Behavior, vol. 7, no. 1, pp. 17–33, 1999. [Online]. Available: https://doi.org/10.1177/105971239900700102
- C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
- S. Pillai and J. Leonard, “Monocular slam supported object recognition,” in Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015.
- E. Herbst, P. Henry, and D. Fox, “Toward online 3-d object segmentation and mapping,” in IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 3193–3200.
- T. Fäulhammer, R. Ambruş, C. Burbridge, M. Zillich, J. Folkesson, N. Hawes, P. Jensfelt, and M. Vincze, “Autonomous learning of object models on a mobile robot,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 26–33, 2017.
- W. Liu, Y. Du, T. Hermans, S. Chernova, and C. Paxton, “StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects,” in Robotics: Science and Systems, 2023. [Online]. Available: https://structdiffusion.github.io/
- Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in IEEE International Conference on Robotics and Automation (ICRA), 2021. [Online]. Available: https://arxiv.org/abs/2012.07277
- Y. Lin, A. S. Wang, E. Undersander, and A. Rai, “Efficient and interpretable robot manipulation with graph neural networks,” IEEE Robotics and Automation Letters, 2022. [Online]. Available: https://arxiv.org/abs/2102.13177
- M. Sharma and O. Kroemer, “Relational learning for skill preconditions,” in Conference on Robot Learning (CoRL), 2020. [Online]. Available: https://arxiv.org/abs/2012.01693
- W. Liu, C. Paxton, T. Hermans, and D. Fox, “StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects,” in IEEE International Conference on Robotics and Automation (ICRA), 2022. [Online]. Available: https://sites.google.com/view/structformer
- A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-dof grasping for target-driven object manipulation in clutter,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 6232–6238. [Online]. Available: https://arxiv.org/abs/1912.03628
- C. Paxton, C. Xie, T. Hermans, and D. Fox, “Predicting Stable Configurations for Semantic Placement of Novel Objects,” in Conference on Robot Learning (CoRL), 11 2021. [Online]. Available: https://arxiv.org/abs/2108.12062
- A. H. Qureshi, A. Mousavian, C. Paxton, M. Yip, and D. Fox, “NeRP: Neural Rearrangement Planning for Unknown Objects,” in Proceedings of Robotics: Science and Systems, Virtual, July 2021. [Online]. Available: https://arxiv.org/abs/2106.01352
- Y. Huang, A. Conkey, and T. Hermans, “Planning for Multi-Object Manipulation with Graph Neural Network Relational Classifiers,” in IEEE International Conference on Robotics and Automation (ICRA), 2023. [Online]. Available: https://arxiv.org/abs/2209.11943
- Y. Huang, N. C. Taylor, A. Conkey, W. Liu, and T. Hermans, “Latent space planning for multi-object manipulation with environment-aware relational classifiers,” arXiv preprint arXiv:2305.10857, 2023.
- R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, “Pact: Perception-action causal transformer for autoregressive robotics pre-training,” arXiv preprint arXiv:2209.11133, 2022.
- S. Wani, S. Patel, U. Jain, A. X. Chang, and M. Savva, “Multi-on: Benchmarking semantic map memory using multi-object navigation,” in Neural Information Processing Systems (NeurIPS), 2020.
- V. Cartillier, Z. Ren, N. Jain, S. Lee, I. Essa, and D. Batra, “Semantic mapnet: Building allocentric semantic maps and representations from egocentric views,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 964–972.
- F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning with temporal skip connections.” CoRL, vol. 12, p. 16, 2017.
- H. Kim, Y. Ohmura, and Y. Kuniyoshi, “Memory-based gaze prediction in deep imitation learning for robot manipulation,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2427–2433.
- S. Caelles, J. Pont-Tuset, , F. Perazzi, A. Montes, K.-K. Maninis, and L. Van Gool, “The 2019 davis challenge on vos: Unsupervised multi-object segmentation,” arXiv:1905.00737, 2019.
- J. Luiten, I. E. Zulfikar, and B. Leibe, “Unovost: Unsupervised offline video object segmentation and tracking,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2000–2009.
- H. Lin, R. Wu, S. Liu, J. Lu, and J. Jia, “Video instance segmentation with a propose-reduce paradigm,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1739–1748.
- T. Zhou, J. Li, X. Li, and L. Shao, “Target-aware object discovery and association for unsupervised video multi-object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 6985–6994.
- J. Yuan, J. Patravali, H. Nguyen, C. Kim, and L. Fuxin, “Maximal cliques on multi-frame proposal graph for unsupervised video object segmentation,” arXiv preprint arXiv:2301.12352, 2023.
- C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, 2020, pp. 440–448. [Online]. Available: https://arxiv.org/abs/1802.08705
- R. Li, A. Jabri, T. Darrell, and P. Agrawal, “Towards practical multi-object manipulation using relational reinforcement learning,” in IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4051–4058. [Online]. Available: https://arxiv.org/abs/1912.11032
- Z. Xu, Z. He, J. Wu, and S. Song, “Learning 3d dynamic scene representations for robot manipulation,” in Conference on Robot Learning (CoRL), 2020.
- M. Du, O. Y. Lee, S. Nair, and C. Finn, “Play it by ear: Learning skills amidst occlusion through audio-visual imitation learning,” arXiv preprint arXiv:2205.14850, 2022.
- A. Curtis, X. Fang, L. P. Kaelbling, T. Lozano-Pérez, and C. R. Garrett, “Long-horizon manipulation of unknown objects via task and motion planning with estimated affordances,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 1940–1946.
- N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” arXiv preprint arXiv:2210.05663, 2022.
- S. Jockel, F. Lindner, and J. Zhang, “Sparse distributed memory for experience-based robot manipulation,” in 2008 IEEE International Conference on Robotics and Biomimetics. IEEE, 2009, pp. 1298–1303.
- S. Jockel, M. Mendes, J. Zhang, A. P. Coimbra, and M. Crisóstomo, “Robot navigation and manipulation based on a predictive associative memory,” in 2009 IEEE 8th International Conference on Development and Learning. IEEE, 2009, pp. 1–7.
- D. Beßler, S. Koralewski, and M. Beetz, “Knowledge representation for cognition-and learning-enabled robot manipulation.” in CogRob@ KR, 2018, pp. 11–19.
- J. Cai, M. Xu, W. Li, Y. Xiong, W. Xia, Z. Tu, and S. Soatto, “Memot: Multi-object tracking with memory,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 8090–8100.
- K. Fang, Y. Xiang, X. Li, and S. Savarese, “Recurrent autoregressive networks for online multi-object tracking,” 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
- H. K. Cheng and A. G. Schwing, “XMem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV, 2022.
- S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- F. Di Felice, S. D’Avella, A. Remus, P. Tripicchio, and C. A. Avizzano, “One-shot imitation learning with graph neural networks for pick-and-place manipulation tasks,” IEEE Robotics and Automation Letters, 2023.
- H. Chen, Y. Niu, K. Hong, S. Liu, Y. Wang, Y. Li, and K. R. Driggs-Campbell, “Predicting object interactions with behavior primitives: An application in stowing tasks,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=VH6WIPF4Sj
- M. Kulshrestha and A. H. Qureshi, “Structural concept learning via graph attention for multi-level rearrangement planning,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=D0X97ODIYK
- W. Yuan, C. Paxton, K. Desingh, and D. Fox, “Sornet: Spatial object-centric representations for sequential manipulation,” in Conference on Robot Learning (CoRL). PMLR, 2022, pp. 148–157. [Online]. Available: https://openreview.net/forum?id=mOLu2rODIJF
- M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
- B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, quan vuong, V. Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. Sanketi, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y. Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y. Kuang, D. Kalashnikov, R. Julian, N. J. Joshi, A. Irpan, brian ichter, J. Hsu, A. Herzog, K. Hausman, K. Gopalakrishnan, C. Fu, P. Florence, C. Finn, K. A. Dubey, D. Driess, T. Ding, K. M. Choromanski, X. Chen, Y. Chebotar, J. Carbajal, N. Brown, A. Brohan, M. G. Arenas, and K. Han, “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=XMQgwiJ7KSX
- A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V. Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in arXiv preprint arXiv:2307.15818, 2023.
- A. Simeonov, A. Goyal, L. Manuelli, Y.-C. Lin, A. Sarmiento, A. R. Garcia, P. Agrawal, and D. Fox, “Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=_xFJuqBId8c
- C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023.
- B. Kim and L. Shimanuki, “Learning value functions with relational state representations for guiding task-and-motion planning,” in Conference on Robot Learning (CoRL), 2019. [Online]. Available: http://people.csail.mit.edu/beomjoon/publications/kim-corl19.pdf
- D. Driess, J.-S. Ha, and M. Toussaint, “Deep visual reasoning: Learning to predict action sequences for task and motion planning from an initial scene image,” in Proceedings of Robotics: Science and Systems, 2020. [Online]. Available: https://arxiv.org/abs/2006.05398
- C. R. Garrett, C. Paxton, T. Lozano-Pérez, L. P. Kaelbling, and D. Fox, “Online replanning in belief space for partially observable task and motion problems,” in IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 5678–5684.
- J. Liang, M. Sharma, A. LaGrassa, S. Vats, S. Saxena, and O. Kroemer, “Search-Based Task Planning with Learned Skill Effect Models for Lifelong Robotic Manipulation,” in IEEE International Conference on Robotics and Automation (ICRA), 2022. [Online]. Available: https://arxiv.org/abs/2109.08771
- Z. Yang, C. R. Garrett, and D. Fox, “Sequence-based plan feasibility prediction for efficient task and motion planning,” arXiv preprint arXiv:2211.01576, 2022.
- Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
- M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep Convolutional Networks on 3D Point Clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9621–9630. [Online]. Available: https://arxiv.org/abs/1811.07246
- H. Cui, Z. Lu, P. Li, and C. Yang, “On positional and structural node features for graph neural networks on non-attributed graphs,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 3898–3902. [Online]. Available: https://arxiv.org/abs/2107.01495
- V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” in Advances in Neural Information Processing Systems, 2021. [Online]. Available: https://sites.google.com/view/isaacgym-nvidia
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.