FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models (2312.17051v2)
Abstract: Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. However, many of these works lack effective exploration of prior knowledge, rendering them unable to effectively address the domain gap issue in the context of 3D FSCIL, thereby leading to catastrophic forgetting. The Contrastive Vision-Language Pre-Training (CLIP) model serves as a highly suitable backbone for addressing the challenges of 3D FSCIL due to its abundant shape-related prior knowledge. Unfortunately, its direct application to 3D FSCIL still faces the incompatibility between 3D data representation and the 2D features, primarily manifested as feature space misalignment and significant noise. To address the above challenges, we introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise. RFE aligns the feature spaces of input point clouds and their embeddings by performing a unique dimensionality reduction on the feature space of pre-trained models (PTMs), effectively eliminating redundant information without compromising semantic integrity. On the other hand, SNC is a graph-based 3D model designed to capture robust geometric information within point clouds, thereby augmenting the knowledge lost due to projection, particularly when processing real-world scanned data. Moreover, traditional accuracy metrics are proven to be biased due to the imbalance in existing 3D datasets. Therefore we propose 3D FSCIL benchmark FSCIL3D-XL and novel evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model. Experimental results on both established and our proposed benchmarks demonstrate that our approach significantly outperforms existing state-of-the-art methods.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Few-shot class-incremental learning for 3d point cloud objects. In European Conference on Computer Vision, pages 204–220. Springer, 2022.
- A tutorial on the cross-entropy method. Annals of operations research, 134:19–67, 2005.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? arXiv preprint arXiv:2212.08320, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Matrix Computations. Johns Hopkins University Press, 1996.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Clip2point: Transfer clip to point cloud classification with image-depth pre-training. arXiv preprint arXiv:2210.01055, 2022.
- Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, pages 657–675. Springer, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
- Continual few-shot relation learning via embedding space regularization and data augmentation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2776–2789, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021a.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021b.
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
- Learning with fantasy: Semantic-aware virtual contrastive constraint for few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24183–24192, 2023.
- Graph few-shot class-incremental learning. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 987–996, 2022.
- Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12183–12192, 2020.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, pages 6438–6447. PMLR, 2019.
- Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9782–9792, 2021.
- Attriclip: A non-incremental learner for incremental knowledge learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3654–3663, 2023.
- Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
- Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
- 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1179–1189, 2023.
- Balanced supervised contrastive learning for few-shot class-incremental learning. arXiv preprint arXiv:2305.16687, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
- Clip2: Contrastive language-image-point pretraining from real-world point cloud data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15244–15253, 2023.
- Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12455–12464, 2021.
- Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in neural information processing systems, 35:27061–27074, 2022a.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022b.
- Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023.
- Few-shot class-incremental learning via class-aware bilateral distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11838–11847, 2023.
- Forward compatible few-shot class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9046–9056, 2022.
- Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need. arXiv preprint arXiv:2303.07338, 2023.
- Not all features matter: Enhancing few-shot clip with adaptive prior refinement. arXiv preprint arXiv:2304.01195, 2023.