Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation (2403.10001v1)
Abstract: Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297–9307, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020.
- Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 98–109, 2023.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Cross-modal & cross-domain learning for unsupervised lidar semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3866–3875, 2023.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2021.
- Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4340–4349, 2016.
- Dsp: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2825–2833, 2021.
- A2D2: Audi Autonomous Driving Dataset. 2020.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4977–4987, 2021.
- Sqn: Weakly-supervised semantic segmentation of large-scale 3d point clouds. In European Conference on Computer Vision, pages 600–619. Springer, 2022.
- xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12605–12614, 2020.
- Cross-modal learning for domain adaptation in 3d semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Guided point contrastive learning for semi-supervised point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6423–6432, 2021.
- Learning texture invariant representation for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12975–12984, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. In ICCV, pages 4015–4026, 2023.
- Lasermix for semi-supervised lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21705–21715, 2023.
- Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
- Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning. ISPRS Journal of Photogrammetry and Remote Sensing, 176:211–221, 2021.
- Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347, 2023.
- Mix3d: Out-of-context data augmentation for 3d scenes. In 2021 International Conference on 3D Vision (3DV), pages 116–125. IEEE, 2021.
- Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3764–3773, 2020.
- Sparse-to-dense feature matching: Intra and inter domain cross-modal learning in domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7108–7117, 2021.
- Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1757–1767, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Unsupervised domain adaptation in lidar semantic segmentation with self-supervision and gated adapters. In 2022 International Conference on Robotics and Automation (ICRA), pages 2649–2655. IEEE, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
- Cosmix: Compositional semantic mix for domain adaptation in 3d lidar segmentation. In European Conference on Computer Vision, pages 586–602. Springer, 2022.
- Compositional semantic mix for domain adaptation in point cloud segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Domain generalization of 3d semantic segmentation in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18077–18087, 2023.
- Mm-tta: Multi-modal test-time adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16928–16937, 2022.
- Unsupervised domain adaptation in semantic segmentation: A review. Technologies, 8(2):35, 2020.
- Llama: Open and efficient foundation language models. corr, abs/2302.13971, 2023. doi: 10.48550. arXiv preprint arXiv.2302.13971, 2023.
- Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2517–2526, 2019a.
- Dada: Depth-aware domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7364–7373, 2019b.
- Gpt-4: A new era of artificial intelligence in medicine. Irish Journal of Medical Science, pages 1–4, 2023.
- Hierarchical open-vocabulary universal image segmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023b.
- Multi-path region mining for weakly supervised 3d semantic segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4384–4393, 2020.
- Dual mixup regularized learning for adversarial domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 540–555. Springer, 2020.
- Polarmix: A general data augmentation technique for lidar point clouds. Advances in Neural Information Processing Systems, 35:11035–11048, 2022.
- Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6502–6509, 2020.
- Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4085–4095, 2020.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023a.
- Label-guided knowledge distillation for continual semantic segmentation on 2d images and 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18601–18612, 2023b.
- Complete & label: A domain adaptation approach to semantic segmentation of lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15363–15373, 2021.
- When 3d bounding-box meets sam: Point cloud instance segmentation with weak-and-noisy supervision. arXiv preprint arXiv:2309.00828, 2023.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
- Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Joint adversarial learning for domain adaptation in semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6877–6884, 2020.
- Growsp: Unsupervised semantic segmentation of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17619–17629, 2023.
- Few-shot 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8873–8882, 2021.
- Geometry-aware self-training for unsupervised domain adaptation on object point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6403–6412, 2021.
- Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
- Jingyi Xu (49 papers)
- Weidong Yang (33 papers)
- Lingdong Kong (49 papers)
- Youquan Liu (16 papers)
- Rui Zhang (1138 papers)
- Qingyuan Zhou (7 papers)
- Ben Fei (35 papers)