Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery (2403.11812v1)
Abstract: We present a neural radiance field method for urban-scale semantic and building-level instance segmentation from aerial images by lifting noisy 2D labels to 3D. This is a challenging problem due to two primary reasons. Firstly, objects in urban aerial images exhibit substantial variations in size, including buildings, cars, and roads, which pose a significant challenge for accurate 2D segmentation. Secondly, the 2D labels generated by existing segmentation methods suffer from the multi-view inconsistency problem, especially in the case of aerial images, where each image captures only a small portion of the entire scene. To overcome these limitations, we first introduce a scale-adaptive semantic label fusion strategy that enhances the segmentation of objects of varying sizes by combining labels predicted from different altitudes, harnessing the novel-view synthesis capabilities of NeRF. We then introduce a novel cross-view instance label grouping strategy based on the 3D scene representation to mitigate the multi-view inconsistency problem in the 2D instance labels. Furthermore, we exploit multi-view reconstructed depth priors to improve the geometric quality of the reconstructed radiance field, resulting in enhanced segmentation results. Experiments on multiple real-world urban-scale datasets demonstrate that our approach outperforms existing methods, highlighting its effectiveness.
- Deep learning-based semantic segmentation of urban-scale 3D meshes in remote sensing: A survey. International Journal of Applied Earth Observation and Geoinformation, 2023.
- Building Rome in a day. Communications of the ACM, 2011.
- Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975.
- Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion. arXiv:2306.04633, 2023.
- DM-NeRF: 3d scene geometry decomposition and manipulation from 2d images. arXiv preprint arXiv:2208.07227, 2022.
- Large-scale semantic 3d reconstruction: an adaptive multi-resolution model for multi-class volumetric labeling. In CVPR, 2016.
- Neural implicit vision-language feature fields. arXiv preprint arXiv:2303.10962, 2023.
- Segment anything in 3d with nerfs. NeurIPS, 36, 2024.
- Tensorf: Tensorial radiance fields. In ECCV, pages 333–350. Springer, 2022.
- 3-D instance segmentation of MVS buildings. IEEE Transactions on Geoscience and Remote Sensing, 2022.
- STPLS3D: A large-scale synthetic and real aerial photogrammetry 3d point cloud dataset. arXiv preprint arXiv:2203.09065, 2022.
- Interactive segment anything NeRF with feature imitation. arXiv preprint arXiv:2305.16233, 2023.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Per-pixel classification is not all you need for semantic segmentation. 2021.
- Panoptic compositional feature field for editable scene rendering with network-inferred labels via metric learning. In CVPR, 2023.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- Depth-supervised NeRF: Fewer views and faster training for free. In CVPR, 2022.
- Plenoxels: Radiance fields without neural networks. In CVPR, pages 5501–5510, 2022.
- An automated method for large-scale, ground-based city model acquisition. IJCV, pages 5–24, 2004.
- Geo-NeuS: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. NeurIPS, 2022.
- Panoptic nerf: 3D-to-2D label transfer for panoptic urban scene segmentation. In 3DV, 2022.
- Towards internet-scale multi-view stereo. In CVPR, 2010.
- Accurate, dense, and robust multi-view stereopsis. TPAMI, 2010.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
- StreetSurf: Extending multi-view implicit surface reconstruction to street views. arXiv preprint arXiv:2306.04988, 2023.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Lidar-based panoptic segmentation via dynamic shifting network. In CVPR, 2021.
- 3d concept learning and reasoning from multi-view images. In CVPR, 2023.
- Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges. In CVPR, 2021.
- Sensaturban: Learning semantics from urban-scale photogrammetric point clouds. International Journal of Computer Vision, 130(2):316–343, 2022.
- RandLA-Net: Efficient semantic segmentation of large-scale point clouds. In CVPR, 2020.
- MOPT: Multi-object panoptic tracking. arXiv preprint arXiv:2004.08189, 2020.
- 3d gaussian splatting for real-time radiance field rendering. TOG, 2023.
- Lerf: Language embedded radiance fields. In ICCV, pages 19729–19739, 2023.
- ADAM: A method for stochastic optimization. In ICLR, 2015.
- Segment anything. arXiv:2304.02643, 2023.
- Decomposing NeRF for editing via feature field distillation. NeurIPS, 2022.
- Panoptic neural fields: A semantic object-aware neural scene representation. In CVPR, 2022.
- Stratified transformer for 3D point cloud segmentation. In CVPR, 2022.
- Russell Land. detectron2-spacenet. https://github.com/rcland12/detectron2-spacenet, 2023.
- Modeling and recognition of landmark image collections using iconic scene graphs. In ECCV, pages 427–440, 2008.
- MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. In ICCV, 2023.
- Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. TPAMI, 2022.
- Capturing, reconstructing, and simulating: the UrbanScene3D dataset. In ECCV, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Instance neural radiance field. In ICCV, pages 787–796, 2023.
- A large-scale outdoor multi-modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In ICCV, 2023.
- UAVid: A semantic segmentation dataset for UAV imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 2020.
- Beyond single receptive field: A receptive field fusion-and-stratification network for airborne laser scanning point cloud classification. ISPRS Journal of Photogrammetry and Remote Sensing, 188:45–61, 2022.
- Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras. In CVPR, pages 1311–1321, 2022.
- Diffuser: Multi-view 2D-to-3D label diffusion for semantic scene segmentation. In ICRA, 2021.
- HDBSCAN: Hierarchical density based clustering. J. Open Source Softw., 2017.
- Switch-nerf: Learning scene decomposition with mixture of experts for large-scale neural radiance fields. In ICLR, 2022.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. TOG, 41(4):1–15, 2022.
- Modeling urban scenes from pointclouds. In ICCV, 2017.
- Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, pages 3504–3515, 2020.
- UNISURF: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, pages 5589–5599, 2021.
- DeepSDF: Learning continuous signed distance functions for shape representation. In CVPR, pages 165–174, 2019.
- PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 2017.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
- PointNet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 44(3):1623–1637, 2020.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In CVPR, 2021.
- Structure-from-motion revisited. In CVPR, pages 4104–4113, 2016.
- Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518, 2016.
- Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, 2000.
- Panoptic lifting for 3d scene understanding with neural fields. In CVPR, 2023.
- Photo tourism: Exploring photo collections in 3d. In SIGGRAPH, 2006.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5459–5469, 2022.
- OpenMask3D: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631, 2023.
- Block-nerf: Scalable large scene neural view synthesis. In CVPR, pages 8248–8258, 2022.
- Jiaxiang Tang. Torch-ngp: a pytorch implementation of instant-ngp. https://github.com/ashawkey/torch-ngp, 2022.
- Compressible-composable nerf via rank-residual decomposition. arXiv preprint arXiv:2205.14870, 2022.
- State of the art on neural rendering. In Computer Graphics Forum, 2020.
- Advances in neural rendering. arXiv preprint arXiv:2111.05849, 2021.
- Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In CVPR, pages 12922–12931, 2022.
- Spacenet: A remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232, 2018.
- NeSF: Neural semantic fields for generalizable semantic segmentation of 3d scenes. arXiv:2111.13260, 2021.
- Softgroup for 3d instance segmentation on point clouds. In CVPR, 2022.
- Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 2022.
- NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, volume 34, 2021.
- Scalable neural indoor scene rendering. TOG, 2022.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In ECCV, 2022.
- Neural fields in visual computing and beyond. CGF, 41(2):641–676, 2022.
- Grid-guided neural radiance fields for large urban scenes. In CVPR, pages 8296–8306, 2023.
- Learning object bounding boxes for 3d instance segmentation on point clouds. NeurIPS, 2019.
- UrbanBIS: a large-scale benchmark for fine-grained urban building instance segmentation. In SIGGRAPH, 2023.
- Volume rendering of neural implicit surfaces. NeurIPS, 34:4805–4815, 2021.
- Multiview neural surface reconstruction by disentangling geometry and appearance. In NeurIPS, volume 33, pages 2492–2502, 2020.
- ISAT with segment anything: Image segmentation annotation tool with segment anything. https://github.com/yatengLG/ISAT_with_segment_anything, 2023.
- GSPN: Generative shape proposal network for 3d instance segmentation in point cloud. In CVPR, 2019.
- Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS, 2022.
- Nerfusion: Fusing radiance fields for large-scale scene reconstruction. In CVPR, pages 5449–5458, 2022.
- Efficient large-scale scene representation with a hybrid of high-resolution grid and plane features. arXiv preprint arXiv:2303.03003, 2023.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Very large-scale global sfm by distributed motion averaging. In CVPR, pages 4568–4577, 2018.
- Yuqi Zhang (54 papers)
- Guanying Chen (32 papers)
- Jiaxing Chen (9 papers)
- Shuguang Cui (275 papers)