Point-In-Context: Understanding Point Cloud via In-Context Learning (2404.12352v1)
Abstract: With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- A. Bar, Y. Gandelsman, T. Darrell, A. Globerson, and A. Efros, “Visual prompting via image inpainting,” in NeurIPS, 2022.
- L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas, “A scalable active framework for region annotation in 3d shape collections,” TOG, 2016.
- A. Takmaz, J. Schult, I. Kaftan, M. Akçay, B. Leibe, R. Sumner, F. Engelmann, and S. Tang, “3d segmentation of humans in point clouds with synthetic data,” in ICCV, 2023.
- B. L. Bhatnagar, X. Xie, I. A. Petrov, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “Behave: Dataset and method for tracking human object interactions,” in CVPR, 2022.
- L. Liu, W. Xu, H. Fu, S. Qian, Q. Yu, Y. Han, and C. Lu, “Akb-48: A real-world articulated object knowledge base,” in CVPR, 2022.
- J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” in NeurIPS, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021.
- F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, “Segment anything meets point tracking,” ICCV, 2023.
- X. Li, H. Yuan, W. Li, H. Ding, S. Wu, W. Zhang, Y. Li, K. Chen, and C. C. Loy, “Omg-seg: Is one model good enough for all segmentation?” in CVPR, 2024.
- X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak in images: A generalist painter for in-context visual learning,” in CVPR, 2023.
- J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y. Yang, X. Li, J. Zhang, Y. Tong, X. Jiang, B. Ghanem, and D. Tao, “Towards open vocabulary learning: A survey,” TPAMI, 2024.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” in IJCV, 2022.
- O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” in NAACL, 2022.
- X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton-in-context: Unified skeleton sequence modeling with in-context learning,” in CVPR, 2024.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” Technical report, OpenAI, 2018.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in CVPR, 2022.
- X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv:2110.07602, 2021.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv:2101.00190, 2021.
- J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “Parameter-efficient image-to-video transfer learning,” arXiv:2206.13559, 2022.
- X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang, “Seggpt: Segmenting everything in context,” in ICCV, 2023.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022.
- Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” in ECCV, 2022.
- R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders,” in CVPR, 2023.
- R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,” in NeurIPS, 2022.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv:1512.03012, 2015.
- Z. Fang, X. Li, X. Li, J. M. Buhmann, C. C. Loy, and M. Liu, “Explore in-context learning for 3d point cloud understanding,” in NeurIPS, 2024.
- Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in CVPR, 2015.
- Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” in CVPR, 2020.
- Z. Fang, X. Li, X. Li, S. Zhao, and M. Liu, “Modelnet-o: A large-scale synthetic dataset for occlusion-aware point cloud classification,” arXiv:2401.08210, 2024.
- T. Wu, J. Zhang, X. Fu, Y. Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian et al., “Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation,” in CVPR, 2023.
- Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” TPAMI, 2020.
- H. Fan, X. Yu, Y. Yang, and M. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” TPAMI, 2021.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
- G. Qian, Y. Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in NeurIPS, 2022.
- Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” in TOG, 2019.
- Z. Liu, S. Zhou, C. Suo, P. Yin, W. Chen, H. Wang, H. Li, and Y.-H. Liu, “Lpd-net: 3d point cloud learning for large-scale place recognition and environment analysis,” in ICCV, 2019.
- L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, and J. Jia, “Hierarchical point-edge interaction network for point cloud semantic segmentation,” in ICCV, 2019.
- Z.-H. Lin, S.-Y. Huang, and Y.-C. F. Wang, “Learning of 3d graph convolution networks for point cloud analysis,” TPAMI, 2021.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in ICCV, 2021.
- M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “Pct: Point cloud transformer,” in CVM, 2021.
- X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou, “Pointr: Diverse point cloud completion with geometry-aware transformers,” in ICCV, 2021.
- X. Wu, Y. Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer v2: Grouped vector attention and partition-based pooling,” in NeurIPS, 2022.
- X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler, faster, stronger,” arXiv:2312.10035, 2023.
- R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in CVPR, 2022.
- Y. Zeng, C. Jiang, J. Mao, J. Han, C. Ye, Q. Huang, D.-Y. Yeung, Z. Yang, X. Liang, and H. Xu, “Clip2: Contrastive language-image-point pretraining from real-world point cloud data,” in CVPR, 2023.
- Z. Wang, Y. Rao, X. Yu, J. Zhou, and J. Lu, “Point-to-pixel prompting for point cloud analysis with pre-trained image models,” TPAMI, 2024.
- S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser et al., “Openscene: 3d scene understanding with open vocabularies,” in CVPR, 2023.
- Z. Deng, X. Li, X. Li, Y. Tong, S. Zhao, and M. Liu, “Vg4d: Vision-language model goes 4d video recognition,” in ICRA, 2024.
- G. Song and K. M. Lee, “Bi-directional seed attention network for interactive image segmentation,” SPL, 2020.
- X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia, “Stratified transformer for 3d point cloud segmentation,” in CVPR, 2022.
- J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in ICRA, 2023.
- A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv:2312.00752, 2023.
- T. Zhang, X. Li, H. Yuan, S. Ji, and S. Yan, “Point could mamba: Point cloud learning via state space model,” arXiv:2403.00762, 2024.
- D. Liang, X. Zhou, X. Wang, X. Zhu, W. Xu, Z. Zou, X. Ye, and X. Bai, “Pointmamba: A simple state space model for point cloud analysis,” arXiv:2402.10739, 2024.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, 2019.
- H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” in ICLR, 2021.
- X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in CVPR, 2022.
- Y. Liu, C. Chen, C. Wang, X. King, and M. Liu, “Regress before construct: Regress autoencoder for point cloud self-supervised learning,” in ACM MM, 2023.
- J. Jiang, X. Lu, L. Zhao, R. Dazaley, and M. Wang, “Masked autoencoders in 3d point cloud representation learning,” TMM, 2023.
- S. Hong, M. Yavartanoo, R. Neshatavar, and K. M. Lee, “Acl-spc: Adaptive closed-loop system for self-supervised point cloud completion,” in CVPR, 2023.
- W. Lee, S. Son, and K. M. Lee, “Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network,” in CVPR, 2022.
- A. Chen, K. Zhang, R. Zhang, Z. Wang, Y. Lu, Y. Guo, and S. Zhang, “Pimae: Point cloud and image interactive masked autoencoders for 3d object detection,” in CVPR, 2023.
- Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and L. Yi, “Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining,” in ICML, 2023.
- R. Dong, Z. Qi, L. Zhang, J. Zhang, J. Sun, Z. Ge, L. Yi, and K. Ma, “Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning?” in ICLR, 2022.
- Y. Zhang and Q. Yang, “A survey on multi-task learning,” TKDE, 2021.
- Y. Liu, Y. Lu, H. Liu, Y. An, Z. Xu, Z. Yao, B. Zhang, Z. Xiong, and C. Gui, “Hierarchical prompt learning for multi-task learning,” in CVPR, 2023.
- H. Chen, X. Han, Z. Wu, and Y.-G. Jiang, “Multi-prompt alignment for multi-source unsupervised domain adaptation,” in NeurIPS, 2024.
- I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” in CVPR, 2016.
- P. Guo, C.-Y. Lee, and D. Ulbricht, “Learning to branch for multi-task learning,” in ICML, 2020.
- S. Huang, X. Li, Z.-Q. Cheng, Z. Zhang, and A. Hauptmann, “Gnas: A greedy neural architecture search method for multi-attribute learning,” in ACM MM, 2018.
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in ECCV, 2018.
- S. Vandenhende, S. Georgoulis, B. De Brabandere, and L. Van Gool, “Branched multi-task networks: deciding what layers to share,” arXiv:1904.02920, 2019.
- D. S. Raychaudhuri, Y. Suh, S. Schulter, X. Yu, M. Faraki, A. K. Roy-Chowdhury, and M. Chandraker, “Controllable dynamic multi-task architectures,” in CVPR, 2022.
- K. Hassani and M. Haley, “Unsupervised multi-task feature learning on point clouds,” in ICCV, 2019.
- J. Zhou, C. Long, Y. Xie, J. Wang, B. Li, H. Wang, Z. Chen, and Z. Dong, “Venvision3d: A synthetic perception dataset for 3d multi-task model research,” arXiv:2402.19059, 2024.
- L. Zhao, Y. Hu, X. Yang, Z. Dou, and L. Kang, “Robust multi-task learning network for complex lidar point cloud data preprocessing,” ESWA, 2024.
- Z. Shan, Q. Yang, R. Ye, Y. Zhang, Y. Xu, X. Xu, and S. Liu, “Gpa-net: No-reference point cloud quality assessment with multi-task graph convolutional network,” TVCG, 2023.
- J. Wang and Y. Qi, “Multi-task learning and joint refinement between camera localization and object detection,” in CVM, 2024.
- G. Yan, Y.-H. Wu, and X. Wang, “Dnact: Diffusion guided multi-task 3d policy learning,” arXiv:2403.04115, 2024.
- S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit bayesian inference,” in ICLR, 2021.
- Y. Sun, Q. Chen, J. Wang, J. Wang, and Z. Li, “Exploring effective factors for improving visual in-context learning,” arXiv:2304.04748, 2023.
- I. Balazevic, D. Steiner, N. Parthasarathy, R. Arandjelović, and O. Henaff, “Towards in-context scene understanding,” in NeurIPS, 2024.
- R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, Y. Qiao, P. Gao, and H. Li, “Personalize segment anything model with one shot,” in ICLR, 2023.
- Y. Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?” in NeurIPS, 2024.
- J. Zhang, B. Wang, L. Li, Y. Nakashima, and H. Nagahara, “Instruct me more! random prompting for visual in-context learning,” in WACV, 2024.
- H. Chen, Y. Dong, Z. Lu, Y. Yu, and J. Han, “Self-prompting perceptual edge learning for dense prediction,” TCSVT, 2023.
- Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros, “Sequential modeling enables scalable learning for large vision models,” arXiv:2312.00785, 2023.
- Z. Wang, Y. Jiang, Y. Lu, P. He, W. Chen, Z. Wang, M. Zhou et al., “In-context learning unlocked for diffusion models,” in NeurIPS, 2024.
- Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y. Wang, Y. Rao, J. Liu, T. Huang et al., “Generative multimodal models are in-context learners,” arXiv:2312.13286, 2023.
- Y. Gan, S. Park, A. M. Schubert, A. Philippakis, and A. Alaa, “Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists,” in ICLR, 2023.
- C. Wang, X. Li, H. Ding, L. Qi, J. Zhang, Y. Tong, C. C. Loy, and S. Yan, “Explore in-context segmentation via latent diffusion models,” arXiv:2403.09616, 2024.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023.
- H. Nam, D. S. Jung, Y. Oh, and K. M. Lee, “Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction,” in ICCV, 2023.
- H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in CVPR, 2017.
- R. Girshick, “Fast r-cnn,” in ICCV, 2015.
- I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” in ICLR, 2018.