Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Unified Representation of Multi-Modal Pre-training for 3D Understanding via Differentiable Rendering (2404.13619v1)

Published 21 Apr 2024 in cs.MM

Abstract: State-of-the-art 3D models, which excel in recognition tasks, typically depend on large-scale datasets and well-defined category sets. Recent advances in multi-modal pre-training have demonstrated potential in learning 3D representations by aligning features from 3D shapes with their 2D RGB or depth counterparts. However, these existing frameworks often rely solely on either RGB or depth images, limiting their effectiveness in harnessing a comprehensive range of multi-modal data for 3D applications. To tackle this challenge, we present DR-Point, a tri-modal pre-training framework that learns a unified representation of RGB images, depth images, and 3D point clouds by pre-training with object triplets garnered from each modality. To address the scarcity of such triplets, DR-Point employs differentiable rendering to obtain various depth images. This approach not only augments the supply of depth images but also enhances the accuracy of reconstructed point clouds, thereby promoting the representative learning of the Transformer backbone. Subsequently, using a limited number of synthetically generated triplets, DR-Point effectively learns a 3D representation space that aligns seamlessly with the RGB-Depth image space. Our extensive experiments demonstrate that DR-Point outperforms existing self-supervised learning methods in a wide range of downstream tasks, including 3D object classification, part segmentation, point cloud completion, semantic segmentation, and detection. Additionally, our ablation studies validate the effectiveness of DR-Point in enhancing point cloud understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Self-supervised learning for pre-training 3d point clouds: A survey. arXiv:2305.04691, 2023.
  2. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. arXiv:2205.14401, 2022a.
  3. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022b.
  4. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9902–9912, 2022.
  5. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. arXiv:2210.01055, 2022.
  6. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  7. Masked autoencoders for point cloud self-supervised learning. In Proceedings of the European Conference on Computer Vision, pages 604–621. Springer, 2022.
  8. Masked discrimination for self-supervised learning on point clouds. In Proceedings of the European Conference on Computer Vision, pages 657–675. Springer, 2022.
  9. Complete 3d relationships extraction modality alignment network for 3d dense captioning. IEEE Transactions on Visualization and Computer Graphics, 2023.
  10. Comprehensive visual question answering on point clouds through compositional scene manipulation. IEEE Transactions on Visualization and Computer Graphics, 2023a.
  11. Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics, 2023.
  12. Learning an interpretable stylized subspace for 3d-aware animatable artforms. IEEE Transactions on Visualization and Computer Graphics, 2024.
  13. Pointvst: Self-supervised pre-training for 3d point clouds via view-specific point-to-image translation. IEEE Transactions on Visualization and Computer Graphics, 2023.
  14. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics, 38(5):1–12, 2019.
  15. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017a.
  16. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30, 2017b.
  17. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv:2202.07123, 2022.
  18. Pointpronets: Consolidation of point clouds with convolutional neural networks. In Computer Graphics Forum, volume 37, pages 87–99. Wiley Online Library, 2018.
  19. Point-based probabilistic surfaces to show surface uncertainty. IEEE Transactions on Visualization and Computer Graphics, 10(5):564–573, 2004.
  20. Learning efficient point cloud generation for dense 3d object reconstruction. In proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  21. V4d: Voxel for 4d novel view synthesis. IEEE Transactions on Visualization and Computer Graphics, 2023.
  22. Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Transactions on Graphics, 37(6):1–12, 2018.
  23. A comparison of gradient estimation methods for volume rendering on unstructured meshes. IEEE Transactions on Visualization and Computer Graphics, 17(3):305–319, 2009.
  24. Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems, 32, 2019.
  25. Cone-traced supersampling with subpixel edge reconstruction. IEEE Transactions on Visualization and Computer Graphics, 2023.
  26. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Advances in Neural Information Processing Systems, 32, 2019.
  27. Unsupervised learning of shape and pose with differentiable point clouds. Advances in Neural Information Processing Systems, 31, 2018.
  28. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  29. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  30. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015.
  31. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5565–5573, 2019.
  32. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision, pages 87–102, 2018.
  33. Pointcnn: Convolution on x-transformed points. Advances in Neural Information Processing Systems, 31, 2018.
  34. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6411–6420, 2019.
  35. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5376–5385, 2020.
  36. Densepoint: Learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5239–5248, 2019.
  37. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
  38. Pvt: Point-voxel transformer for 3d deep learning. arXiv:2108.06076, 2, 2021a.
  39. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021a.
  40. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9782–9792, 2021a.
  41. Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6535–6545, 2021.
  42. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1588–1597, 2019.
  43. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
  44. Implicit autoencoder for point-cloud self-supervised representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14530–14542, 2023b.
  45. Randomrooms: Unsupervised pre-training from synthetic shapes and randomized layouts for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3283–3292, 2021.
  46. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Proceedings of the European Conference on Computer Vision, pages 574–591. Springer, 2020a.
  47. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
  48. Asfm-net: Asymmetrical siamese feature matching network for point completion. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1938–1947, 2021.
  49. Grnet: Gridding residual network for dense point cloud completion. In European Conference on Computer Vision, pages 365–381. Springer, 2020b.
  50. Cascaded refinement network for point cloud completion with self-supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8139–8150, 2021b.
  51. Topnet: Structural point cloud decoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 383–392, 2019.
  52. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 206–215, 2018.
  53. Pcn: Point completion network. In International Conference on 3D Vision, pages 728–737. IEEE, 2018.
  54. Liang Pan. Ecg: Edge-aware point cloud completion with graph convolution. IEEE Robotics and Automation Letters, 5(3):4392–4398, 2020.
  55. Pointr: Diverse point cloud completion with geometry-aware transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12498–12507, 2021.
  56. Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5499–5509, 2021.
  57. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
  58. Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10252–10263, 2021b.
  59. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics, 35(6):1–12, 2016.
  60. Variational relational point completion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8524–8533, 2021.
  61. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016.
  62. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
  63. Relationship-based point cloud completion. IEEE Transactions on Visualization and Computer Graphics, 28(12):4940–4950, 2021b.
  64. Csdn: Cross-modal shape-transfer dual-refinement network for point cloud completion. IEEE Transactions on Visualization and Computer Graphics, 2023.
  65. Consistent two-flow network for tele-registration of point clouds. IEEE Transactions on Visualization and Computer Graphics, 28(12):4304–4318, 2021.
  66. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ben Fei (35 papers)
  2. Yixuan Li (183 papers)
  3. Weidong Yang (33 papers)
  4. Lipeng Ma (7 papers)
  5. Ying He (102 papers)

Summary

We haven't generated a summary for this paper yet.