Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation (2307.04231v1)
Abstract: Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict results only via 2D-3D complementarity that is obtained by cross-modal feature matching. However, as lacking supervision in the target domain, the complementarity is not always reliable. The results are not ideal when the domain gap is large. To solve the problem of lacking supervision, we introduce masked modeling into this task and propose a method Mx2M, which utilizes masked cross-modality modeling to reduce the large domain gap. Our Mx2M contains two components. One is the core solution, cross-modal removal and prediction (xMRP), which makes the Mx2M adapt to various scenarios and provides cross-modal self-supervision. The other is a new way of cross-modal feature matching, the dynamic cross-modal filter (DxMF) that ensures the whole method dynamically uses more suitable 2D-3D complementarity. Evaluation of the Mx2M on three DA scenarios, including Day/Night, USA/Singapore, and A2D2/SemanticKITTI, brings large improvements over previous methods on many metrics.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 33: 12449–12460.
- Beit: Bert pre-training of image transformers. In ICLR.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 9297–9307.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 11621–11631.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 6299–6308.
- 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In ECCV, 452–468.
- Imagenet: A large-scale hierarchical image database. In CVPR, 248–255.
- VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. arXiv preprint arXiv:2111.12681.
- Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR, 15490–15500. IEEE.
- Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. NeurIPS, 33: 13890–13902.
- Learning 3D Semantic Segmentation with only 2D Image Supervision. In 3DV, 361–372. IEEE.
- A2D2: AEV autonomous driving dataset. http://www.a2d2.audi/.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 9224–9232.
- Masked Autoencoders Are Scalable Vision Learners. CVPR.
- Deep Residual Learning for Image Recognition. In CVPR, 770–778.
- Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 1989–1998. PMLR.
- Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649.
- Bidirectional projection network for cross dimension scene understanding. In CVPR, 14373–14382.
- xmuda: Cross-modal Unsupervised Domain Adaptation for 3d Semantic Segmentation. In CVPR, 12605–12614.
- Self-supervised feature learning by cross-modality and cross-view correspondences. In CVPRW, 1581–1591.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 4171–4186.
- Adam: A Method for Stochastic Optimization. In ICLR.
- Looking into your speech: Learning cross-modal affinity for audio-visual speech separation. In CVPR, 1336–1345.
- Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, 6936–6945.
- Adversarial Unsupervised Domain Adaptation for 3D semantic Segmentation with Multi-modal Learning. ISPRS Journal of Photogrammetry and Remote Sensing, 176: 211–221.
- Learning from 2D: Contrastive Pixel-to-Point Knowledge Transfer for 3D Pretraining. arXiv preprint arXiv:2104.04687.
- 3D-to-2D distillation for indoor scene parsing. In CVPR, 4464–4474.
- Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247.
- Unsupervised scene adaptation for semantic segmentation of urban mobile laser scanning point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 169: 253–267.
- Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In CVPR, 2507–2516.
- Minimal-Entropy Correlation Alignment for Unsupervised Deep Domain Adaptation. In ICLR.
- Sparse-to-dense Feature Matching: Intra and Inter Domain Cross-modal Learning in Domain Adaptation for 3d Semantic Segmentation. In ICCV, 7108–7117.
- Pointdan: A multi-scale 3d domain adaption network for point cloud representation. NeurIPS, 32.
- Learning transferable visual models from natural language supervision. In ICML, 8748–8763. PMLR.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 234–241. Springer.
- 3-D convolutional encoder-decoder network for low-dose CT via transfer learning from a 2-D trained network. IEEE transactions on medical imaging, 37(6): 1522–1534.
- Learning to adapt structured output space for semantic segmentation. In CVPR, 7472–7481.
- Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR, 2517–2526.
- Graph attention convolution for point cloud semantic segmentation. In CVPR, 10296–10305.
- SOLO: Segmenting Objects by Locations. In ECCV.
- SOLOv2: Dynamic and Fast Instance Segmentation. In NeurIPS.
- Revisiting the Transferability of Supervised Pretraining: an MLP Perspective. CVPR.
- Time-Domain Audio-Visual Speech Separation on Low Quality Videos. In ICASSP, 256–260. IEEE.
- SimMIM: A Simple Framework for Masked Image Modeling. In CVPR.
- Image2point: 3d point-cloud understanding with pretrained 2d convnets. arXiv preprint arXiv:2106.04180.
- Masked Autoencoders are Robust Data Augmentors. arXiv preprint arXiv:2206.04846.
- Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling. In CVPR.
- DOBNET: Dynamic Object Boundary-Refinement Network for Real-Time Instance Segmentation. In ICME, 1–6.
- MFENet: Multi-level feature enhancement network for real-time semantic segmentation. Neurocomputing, 393: 54–65.
- Boxiang Zhang (6 papers)
- Zunran Wang (3 papers)
- Yonggen Ling (15 papers)
- Yuanyuan Guan (3 papers)
- Shenghao Zhang (6 papers)
- Wenhui Li (41 papers)