Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval (2405.04103v1)

Published 7 May 2024 in cs.CV

Abstract: In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
  2. “Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words,” in CVPR, 2023, pp. 6884–6893.
  3. Kevin Chen et al., “Text2shape: Generating shapes from natural language by learning joint embeddings,” in ACCV 2018. Springer, 2019, pp. 100–116.
  4. Zhizhong Han et al., “Y2seq2seq: Cross-modal representation learning for 3d shape and text by joint reconstruction and prediction of view and word sequences,” in AAAI, 2019, vol. 33, pp. 126–133.
  5. Yangyan Li et al., “Joint embeddings of shapes and images via cnn image purification,” ACM transactions on graphics (TOG), vol. 34, no. 6, pp. 1–12, 2015.
  6. “Joint learning of 3d shape retrieval and deformation,” in CVPR, 2021, pp. 11713–11722.
  7. “Mvsnet: Depth inference for unstructured multi-view stereo,” in ECCV, 2018, pp. 767–783.
  8. Mehdi SM Sajjadi et al., “Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations,” in CVPR, 2022, pp. 6229–6238.
  9. Huan Fu et al., “Hard example generation by texture synthesis for cross-domain shape similarity learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 14675–14687, 2020.
  10. “Single image 3d shape retrieval via cross-modal instance and category contrastive learning,” in ICCV, 2021, pp. 11405–11415.
  11. “Tricolo: Trimodal contrastive loss for fine-grained text to shape retrieval,” arXiv preprint arXiv:2201.07366, 2022.
  12. Renrui Zhang et al., “Pointclip: Point cloud understanding by clip,” in CVPR, June 2022, pp. 8552–8562.
  13. Tianyu Huang et al., “Clip2point: Transfer clip to point cloud classification with image-depth pre-training,” in ICCV, 2023, pp. 22157–22167.
  14. “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  15. Xiang Li et al., “Uni3dl: Unified model for 3d and language understanding,” arXiv preprint arXiv:2312.03026, 2023.
  16. Xingyuan Sun et al., “Pix3d: Dataset and methods for single-image 3d shape modeling,” in CVPR, 2018, pp. 2974–2983.
  17. Huan Fu et al., “3d-future: 3d furniture shape with texture,” IJCV, vol. 129, pp. 3313–3337, 2021.
  18. Xiaofeng Wang et al., “Mvster: Epipolar transformer for efficient multi-view stereo,” in ECCV. Springer, 2022, pp. 573–591.
  19. Mehdi S. M. Sajjadi et al., “Rust: Latent neural scene representations from unposed imagery,” in CVPR, June 2023, pp. 17297–17306.
  20. “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
  21. “Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,” in CVPR, 2019, pp. 909–918.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com