Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation (2312.03327v1)

Published 6 Dec 2023 in cs.CV

Abstract: Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. On Evaluation of Embodied Navigation Agents. CoRR abs/1807.06757 (2018). arXiv:1807.06757
  2. Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1VGvBcxl
  3. Vision based MAV navigation in unknown and unstructured environments. In IEEE International Conference on Robotics and Automation, ICRA 2010, Anchorage, Alaska, USA, 3-7 May 2010. IEEE, 21–28. https://doi.org/10.1109/ROBOT.2010.5509920
  4. Johann Borenstein and Yoram Koren. 1990. Real-time obstacle avoidance for fast mobile robots in cluttered environments. (1990), 572–577. https://doi.org/10.1109/ROBOT.1990.126042
  5. Johann Borenstein and Yoram Koren. 1991. The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robotics Autom. 7, 3 (1991), 278–288. https://doi.org/10.1109/70.88137
  6. End-to-End Object Detection with Transformers. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 12346), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
  7. Matterport3D: Learning from RGB-D Data in Indoor Environments. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017. IEEE Computer Society, 667–676. https://doi.org/10.1109/3DV.2017.00081
  8. Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing. ACM Trans. Multim. Comput. Commun. Appl. 19, 1 (2023), 9:1–9:16. https://doi.org/10.1145/3526024
  9. A Behavioral Approach to Visual Navigation with Graph Localization Networks. In Robotics: Science and Systems XV, University of Freiburg, Freiburg im Breisgau, Germany, June 22-26, 2019, Antonio Bicchi, Hadas Kress-Gazit, and Seth Hutchinson (Eds.). https://doi.org/10.15607/RSS.2019.XV.010
  10. Structure-Aware Residual Pyramid Network for Monocular Depth Estimation. In Proc. Int. Joint Conf. Artif. Intell., Sarit Kraus (Ed.). ijcai.org, 694–700. https://doi.org/10.24963/ijcai.2019/98
  11. Mark Cummins and Paul M. Newman. 2007. Probabilistic Appearance Based Navigation and Loop Closing. In 2007 IEEE International Conference on Robotics and Automation, ICRA 2007, 10-14 April 2007, Roma, Italy. IEEE, 2042–2048. https://doi.org/10.1109/ROBOT.2007.363622
  12. Unbiased Directed Object Attention Graph for Object Navigation. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 3617–3627. https://doi.org/10.1145/3503161.3547852
  13. Search for or navigate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8250–8259.
  14. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 3161–3171. https://doi.org/10.1109/CVPR42600.2020.00323
  15. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
  16. A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robotics Autom. 17, 3 (2001), 229–241. https://doi.org/10.1109/70.938381
  17. Learning Object Relation Graph and Tentative Policy for Visual Navigation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 12352), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 19–34. https://doi.org/10.1007/978-3-030-58571-6_2
  18. VTNet: Visual Transformer Network for Object Goal Navigation. In Int. Conf. Learn. Represent. OpenReview.net.
  19. Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 538–547. https://doi.org/10.1109/CVPR.2019.00063
  20. Object Memory Transformer for Object Goal Navigation. In 2022 International Conference on Robotics and Automation, ICRA 2022, Philadelphia, PA, USA, May 23-27, 2022. IEEE, 11288–11294. https://doi.org/10.1109/ICRA46639.2022.9812027
  21. Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 34, 11 (2022), 5415–5428. https://doi.org/10.1109/TKDE.2021.3056502
  22. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735
  24. Decoupling Long-and Short-Term Patterns in Spatiotemporal Inference. IEEE Transactions on Neural Networks and Learning Systems (2023), 1–13. https://doi.org/10.1109/tnnls.2023.3293814
  25. Agent-Centric Relation Graph for Object Visual Navigation. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1–1. https://doi.org/10.1109/TCSVT.2023.3291131
  26. Bag of Tricks for Efficient Text Classification. In Proc. Eur. Assoc. for Comput. Linguistics,, Mirella Lapata, Phil Blunsom, and Alexander Koller (Eds.). 427–431. https://doi.org/10.18653/v1/e17-2068
  27. Autonomous visual navigation of a mobile robot using a human-guided experience. Robotics Auton. Syst. 40, 2-3 (2002), 121–130. https://doi.org/10.1016/S0921-8890(02)00237-3
  28. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
  29. Nate Kohl and Peter Stone. 2004. Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation, ICRA 2004, April 26 - May 1, 2004, New Orleans, LA, USA. IEEE, 2619–2624. https://doi.org/10.1109/ROBOT.2004.1307456
  30. AI2-THOR: An Interactive 3D Environment for Visual AI. CoRR abs/1712.05474 (2017). arXiv:1712.05474 http://arxiv.org/abs/1712.05474
  31. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32–73.
  32. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. https://openreview.net/forum?id=SJiHXGWAZ
  33. SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation. In IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, May 30 - June 5, 2021. IEEE, 13194–13200. https://doi.org/10.1109/ICRA48506.2021.9560925
  34. TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning. IEEE Trans. Image Process. 31 (2022), 1978–1993. https://doi.org/10.1109/TIP.2022.3147032
  35. Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering. IEEE Trans. Image Process. 31 (2022), 1684–1696. https://doi.org/10.1109/TIP.2022.3142526
  36. Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784 (2022).
  37. Visual Navigation With Spatial Attention. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 16898–16907. https://doi.org/10.1109/CVPR46437.2021.01662
  38. Min Meng and Avinash C Kak. 1993. Mobile robot navigation using neural networks and nonmetrical environmental models. IEEE Control Systems Magazine 13, 5 (1993), 30–39.
  39. Scaling Local Control to Large-Scale Topological Navigation. In 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020. IEEE, 672–678. https://doi.org/10.1109/ICRA40945.2020.9196644
  40. Learning to Navigate in Complex Environments. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJMGPrcle
  41. Human-level control through deep reinforcement learning. Nat. 518, 7540 (2015), 529–533. https://doi.org/10.1038/NATURE14236
  42. Autonomous Helicopter Flight via Reinforcement Learning. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], Sebastian Thrun, Lawrence K. Saul, and Bernhard Schölkopf (Eds.). MIT Press, 799–806. https://proceedings.neurips.cc/paper/2003/hash/b427426b8acd2c2e53827970f2c2f526-Abstract.html
  43. On-Line Map Building and Navigation for Autonomous Mobile Robots. In Proceedings of the 1995 International Conference on Robotics and Automation, Nagoya, Aichi, Japan, May 21-27, 1995. IEEE Computer Society, 2900–2906. https://doi.org/10.1109/ROBOT.1995.525695
  44. Monocular Vision Aided Depth Measurement from RGB Images for Autonomous UAV Navigation. ACM Trans. Multimedia Comput. Commun. Appl. (jul 2022). https://doi.org/10.1145/3550485 Just Accepted.
  45. Zero-Shot Visual Imitation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=BkisuzWRW
  46. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543. https://doi.org/10.3115/V1/D14-1162
  47. Jan Peters and Stefan Schaal. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 4 (2008), 682–697. https://doi.org/10.1016/J.NEUNET.2008.02.003
  48. Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=w_7JMpGZRh0
  49. PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 18868–18878. https://doi.org/10.1109/CVPR52688.2022.01832
  50. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/34173cb38f07f89ddbebc2ac9128303f-Abstract-round2.html
  51. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
  52. Semi-parametric topological memory for navigation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SygwwGbRW
  53. Habitat: A Platform for Embodied AI Research. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 9338–9346. https://doi.org/10.1109/ICCV.2019.00943
  54. Indoor Segmentation and Support Inference from RGBD Images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 7576), Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). Springer, 746–760. https://doi.org/10.1007/978-3-642-33715-4_54
  55. Sebastian Thrun. 1998. Learning Metric-Topological Maps for Indoor Mobile Robot Navigation. Artif. Intell. 99, 1 (1998), 21–71. https://doi.org/10.1016/S0004-3702(97)00078-7
  56. Attention is All you Need. (2017), 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  57. Graph Attention Networks. (2018). https://openreview.net/forum?id=rJXMpikCZ
  58. Skill-based Hierarchical Reinforcement Learning for Target Visual Navigation. IEEE Transactions on Multimedia (2023), 1–13. https://doi.org/10.1109/TMM.2023.3243618
  59. Hui Wei and Luping Wang. 2018. Visual Navigation Using Projection of Spatial Right-Angle In Indoor Environment. IEEE Trans. Image Process. 27, 7 (2018), 3164–3177. https://doi.org/10.1109/TIP.2018.2818931
  60. Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 6750–6759. https://doi.org/10.1109/CVPR.2019.00691
  61. Bayesian Relational Memory for Semantic Visual Navigation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2769–2779. https://doi.org/10.1109/ICCV.2019.00286
  62. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 1907–1913. https://doi.org/10.24963/ijcai.2019/264
  63. Gibson Env: Real-World Perception for Embodied Agents. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 9068–9079. https://doi.org/10.1109/CVPR.2018.00945
  64. Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, Abdulmotaleb El-Saddik, Tao Mei, Rita Cucchiara, Marco Bertini, Diana Patricia Tobon Vallejo, Pradeep K. Atrey, and M. Shamim Hossain (Eds.). ACM, 6785–6793. https://doi.org/10.1145/3581783.3612100
  65. Visual Semantic Navigation using Scene Priors. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=HJeRkh05Km
  66. Hierarchical Deep Click Feature Prediction for Fine-Grained Image Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2 (2022), 563–578. https://doi.org/10.1109/TPAMI.2019.2932058
  67. Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition. IEEE Trans. Neural Networks Learn. Syst. 31, 2 (2020), 661–674. https://doi.org/10.1109/TNNLS.2019.2908982
  68. Hierarchical Object-to-Zone Graph for Object Navigation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 15110–15120. https://doi.org/10.1109/ICCV48922.2021.01485
Citations (1)

Summary

We haven't generated a summary for this paper yet.