Multi-task 3D building understanding with multi-modal pretraining (2306.10146v1)
Abstract: This paper explores various learning strategies for 3D building type classification and part segmentation on the BuildingNet dataset. ULIP with PointNeXt and PointNeXt segmentation are extended for the classification and segmentation task on BuildingNet dataset. The best multi-task PointNeXt-s model with multi-modal pretraining achieves 59.36 overall accuracy for 3D building type classification, and 31.68 PartIoU for 3D building part segmentation on validation split. The final PointNeXt XL model achieves 31.33 PartIoU and 22.78 ShapeIoU on test split for BuildingNet-Points segmentation, which significantly improved over PointNet++ model reported from BuildingNet paper, and it won the 1st place in the BuildingNet challenge at CVPR23 StruCo3D workshop.
- 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2016.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019.
- Synthetic 3d data generation pipeline for geometric deep learning in architecture. arXiv preprint arXiv:2104.12564, 2021.
- Benjamin Graham and Laurens Van der Maaten. Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Scenenet: understanding real world indoor scenes with synthetic data. arxiv preprint (2015). arXiv preprint arXiv:1511.07041, 3, 2015.
- What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
- Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
- Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 529–544. Springer, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Buildingnet: Learning to label 3d buildings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10397–10407, 2021.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
- Vitruvio: 3d building meshes via single perspective sketches. arXiv preprint arXiv:2210.13634, 2022.
- Ulip: Learning unified representation of language, image and point cloud for 3d understanding. arXiv preprint arXiv:2212.05171, 2022.
- Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023.
- Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
- Shicheng Xu (36 papers)