Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion (2407.03425v1)

Published 3 Jul 2024 in cs.CV and cs.RO

Abstract: Autonomous mobile robots deployed in urban environments must be context-aware, i.e., able to distinguish between different semantic entities, and robust to occlusions. Current approaches like semantic scene completion (SSC) require pre-enumerating the set of classes and costly human annotations, while representation learning methods relax these assumptions but are not robust to occlusions and learn representations tailored towards auxiliary tasks. To address these limitations, we propose LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas. Our model only requires a single RGBD image, does not require human labels, and operates in real time. We quantitatively demonstrate our approach outperforms existing models trained from scratch on semantic and elevation scene completion tasks with finetuning. Furthermore, we show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion. We evaluate our approach using CODa, a large-scale, real-world urban robot dataset. Supplementary visualizations, code, data, and pre-trained models, will be publicly available soon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access, 8:58443–58469, 2020. doi:10.1109/ACCESS.2020.2983149.
  2. Applications of machine vision in agricultural robot navigation: A review. Computers and Electronics in Agriculture, 198:107085, 2022. ISSN 0168-1699. doi:https://doi.org/10.1016/j.compag.2022.107085. URL https://www.sciencedirect.com/science/article/pii/S0168169922004021.
  3. R. Ventura and P. U. Lima. Search and rescue robots: The civil protection teams of the future. In 2012 Third International Conference on Emerging Security Technologies, pages 12–19, 2012. doi:10.1109/EST.2012.40.
  4. Semantic scene completion from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746–1754, 2017.
  5. J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  6. Terrainnet: Visual modeling of complex terrain for high-speed, off-road navigation. arXiv preprint arXiv:2303.15771, 2023.
  7. Sscnav: Confidence-aware semantic scene completion for visual semantic navigation. CoRR, abs/2012.04512, 2020. URL https://arxiv.org/abs/2012.04512.
  8. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023.
  9. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  10. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  11. Sterling: Self-supervised terrain representation learning from unconstrained robot experience. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=VLihM67Wdi6.
  12. Visual representation learning for preference-aware path planning. CoRR, abs/2109.08968, 2021. URL https://arxiv.org/abs/2109.08968.
  13. V-strong: Visual self-supervised traversability learning for off-road navigation. arXiv preprint arXiv:2312.16016, 2023.
  14. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  15. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9297–9307, 2019.
  16. Sscbench: A large-scale 3d semantic scene completion benchmark for autonomous driving. arXiv preprint arXiv:2306.09001, 2023.
  17. S3cnet: A sparse semantic scene completion network for lidar point clouds. CoRR, abs/2012.09242, 2020. URL https://arxiv.org/abs/2012.09242.
  18. Up-to-down network: Fusing multi-scale context for 3d semantic scene completion. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16–23. IEEE, 2021.
  19. Semantic scene completion using local deep implicit functions on lidar data. IEEE transactions on pattern analysis and machine intelligence, 44(10):7205–7218, 2021.
  20. Symphonize 3d semantic scene completion with contextual instance queries. arXiv preprint arXiv:2306.15670, 2023.
  21. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023.
  22. Motionsc: Data set and network for real-time semantic mapping in dynamic environments. IEEE Robotics and Automation Letters, 7(3):8439–8446, 2022.
  23. Lmscnet: Lightweight multiscale 3d semantic completion. In 2020 International Conference on 3D Vision (3DV), pages 111–119. IEEE, 2020.
  24. Undoing the damage of label shift for cross-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7042–7052, 2022.
  25. Voila: Visual-observation-only imitation learning for autonomous navigation. In 2022 International Conference on Robotics and Automation (ICRA), pages 2497–2503. IEEE, 2022.
  26. Viplanner: Visual semantic imperative learning for local navigation. arXiv preprint arXiv:2310.00982, 2023.
  27. Wait, that feels familiar: Learning to extrapolate human preferences for preference aligned path planning. arXiv preprint arXiv:2309.09912, 2023.
  28. Learning visual locomotion with cross-modal supervision. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7295–7302. IEEE, 2023.
  29. Terrapn: Unstructured terrain navigation using online self-supervised learning. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7197–7204. IEEE, 2022.
  30. Where should i walk? predicting terrain properties from images via self-supervised learning. IEEE Robotics and Automation Letters, 4(2):1509–1516, 2019.
  31. Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10052–10062, 2021.
  32. Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363, 2022.
  33. Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16794–16804, 2021.
  34. Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
  35. S4c: Self-supervised semantic scene completion with neural fields. arXiv preprint arXiv:2310.07522, 2023.
  36. Better call sal: Towards learning to segment anything in lidar. arXiv preprint arXiv:2403.13129, 2024.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077, 2023.
  39. Denoising vision transformers. arXiv preprint arXiv:2401.02957, 2024.
  40. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  41. Supervised contrastive learning. CoRR, abs/2004.11362, 2020. URL https://arxiv.org/abs/2004.11362.
  42. Towards robust robot 3d perception in urban environments: The ut campus object dataset. arXiv preprint arXiv:2309.13549, 2023.
  43. Sparsity invariant cnns. In 2017 international conference on 3D Vision (3DV), pages 11–20. IEEE, 2017.
  44. High-resolution lidar-based depth mapping using bilateral filter. In 2016 IEEE 19th international conference on intelligent transportation systems (ITSC), pages 2469–2474. IEEE, 2016.
  45. Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1366–1373. IEEE, 2017.
  46. Efficient visual pretraining with contrastive detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10086–10096, 2021.
  47. Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters, 2023.
  48. Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  49. H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 807–814. IEEE, 2005.
Citations (1)

Summary

We haven't generated a summary for this paper yet.