Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

iSeg: Interactive 3D Segmentation via Interactive Attention (2404.03219v2)

Published 4 Apr 2024 in cs.CV and cs.GR

Abstract: We present iSeg, a new interactive technique for segmenting 3D shapes. Previous works have focused mainly on leveraging pre-trained 2D foundation models for 3D segmentation based on text. However, text may be insufficient for accurately describing fine-grained spatial segmentations. Moreover, achieving a consistent 3D segmentation using a 2D model is highly challenging, since occluded areas of the same semantic region may not be visible together from any 2D view. Thus, we design a segmentation method conditioned on fine user clicks, which operates entirely in 3D. Our system accepts user clicks directly on the shape's surface, indicating the inclusion or exclusion of regions from the desired shape partition. To accommodate various click settings, we propose a novel interactive attention module capable of processing different numbers and types of clicks, enabling the training of a single unified interactive segmentation model. We apply iSeg to a myriad of shapes from different domains, demonstrating its versatility and faithfulness to the user's specifications. Our project page is at https://threedle.github.io/iSeg/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Zero-Shot 3D Shape Correspondence. SIGGRAPH Asia 2023 Conference Papers (2023).
  2. SATR: Zero-Shot Semantic Segmentation of 3D Shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  3. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints (Feb. 2017). arXiv:1702.01105 [cs.CV]
  4. Yuri Y Boykov and M-P Jolly. 2001. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 1. IEEE, 105–112.
  5. Segment Anything in 3D with NeRFs. arXiv:2304.12308 [cs.CV]
  6. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
  7. Interactive Segment Anything NeRF with Feature Imitation. arXiv:2305.16233 [cs.CV]
  8. Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models. arXiv:2305.08776 [cs.CV]
  9. Bae-net: Branched autoencoder for shape co-segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8490–8499.
  10. Curve-skeleton properties, applications, and algorithms. IEEE Transactions on visualization and computer graphics 13, 3 (2007), 530.
  11. 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation. arXiv preprint arXiv:2311.09571 (2023).
  12. 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20930–20939.
  13. Cvxnet: Learnable convex decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 31–44.
  14. Tamal K Dey and Wulue Zhao. 2004. Approximating the medial axis from the Voronoi diagram with a convergence guarantee. Algorithmica 38, 1 (2004), 179–200.
  15. Interactive Segmentation of Radiance Fields. arXiv:2212.13545 [cs.CV]
  16. Interactive Segmentation of Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4201–4211.
  17. Huy Ha and Shuran Song. 2022. Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. In Proceedings of the 2022 Conference on Robot Learning.
  18. MeshCNN: A Network with an Edge. ACM Transactions on Graphics (TOG) 38, 4 (2019), 90:1–90:12.
  19. Donald D Hoffman and Whitman A Richards. 1984. Parts of recognition. Cognition 18, 1-3 (1984), 65–96.
  20. 3D Concept Grounding on Neural Fields. In Annual Conference on Neural Information Processing Systems.
  21. Subdivision-based Mesh Convolution Networks. ACM Trans. Graph. 41, 3 (2022), 25:1–25:16. https://doi.org/10.1145/3506694
  22. LERF: Language Embedded Radiance Fields. arXiv:2303.09553 [cs.CV]
  23. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026.
  24. Decomposing NeRF for Editing via Feature Field Distillation. arXiv (2022).
  25. Interactive Object Segmentation in 3D Point Clouds. arXiv:2204.07183 [cs.CV]
  26. Virtual Multi-view Fusion for 3D Semantic Segmentation. arXiv e-prints, Article arXiv:2007.13138 (July 2020), arXiv:2007.13138 pages. https://doi.org/10.48550/arXiv.2007.13138 arXiv:2007.13138 [cs.CV]
  27. Alon Lahav and Ayellet Tal. 2020. Meshwalker: Deep mesh understanding by random walks. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–13.
  28. A closed-form solution to natural image matting. IEEE transactions on pattern analysis and machine intelligence 30, 2 (2007), 228–242.
  29. Jyh-Ming Lien and Nancy M Amato. 2007. Approximate convex decomposition of polyhedra. In Proceedings of the 2007 ACM symposium on Solid and physical modeling. 121–131.
  30. Primal-dual mesh convolutional neural networks. Advances in Neural Information Processing Systems 33 (2020), 952–963.
  31. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv e-prints, Article arXiv:2003.08934 (March 2020), arXiv:2003.08934 pages. https://doi.org/10.48550/arXiv.2003.08934 arXiv:2003.08934 [cs.CV]
  32. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. arXiv:1606.04797 [cs.CV]
  33. Automatic differentiation in PyTorch. In NIPS-W.
  34. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv:1612.00593 [cs.CV]
  35. James Matthew Rehg. 2022. Toys4K 3D Object Dataset. https://github.com/rehg-lab/lowshot-shapebias/tree/main/toys4k.
  36. GrabCut -Interactive Foreground Extraction using Iterated Graph Cuts. ACM Transactions on Graphics (SIGGRAPH) (August 2004). https://www.microsoft.com/en-us/research/publication/grabcut-interactive-foreground-extraction-using-iterated-graph-cuts/
  37. Ariel Shamir. 2008. A survey on mesh segmentation techniques. Computer graphics forum 27, 6 (2008), 1539–1556.
  38. Diffusionnet: Discretization agnostic learning on surfaces. ACM Transactions on Graphics (TOG) 41, 3 (2022), 1–16.
  39. Graph Cut Based Multiple View Segmentation for 3D Reconstruction. In Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06). 1085–1092. https://doi.org/10.1109/3DPVT.2006.70
  40. Canonical Capsules: Self-Supervised Capsules in Canonical Pose. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 24993–25005. https://proceedings.neurips.cc/paper/2021/file/d1ee59e20ad01cedc15f5118a7626099-Paper.pdf
  41. TurboSquid. 2021. TurboSquid 3D Model Repository. https://www.turbosquid.com/.
  42. Prior knowledge for part correspondence. Computer Graphics Forum 30, 2 (2011), 553–562. https://doi.org/10.1111/j.1467-8659.2011.01893.x
  43. Attention Is All You Need. Advances in neural information processing systems 30 (2017).
  44. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
  45. SAM3D: Segment Anything in 3D Scenes. arXiv e-prints, Article arXiv:2306.03908 (June 2023), arXiv:2306.03908 pages. https://doi.org/10.48550/arXiv.2306.03908 arXiv:2306.03908 [cs.CV]
  46. FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models. arXiv:2303.12786 [cs.CV]
  47. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2282–2290.
  48. AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation. arXiv:2306.00977 [cs.CV]
  49. SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model. arXiv e-prints, Article arXiv:2306.02245 (June 2023), arXiv:2306.02245 pages. https://doi.org/10.48550/arXiv.2306.02245 arXiv:2306.02245 [cs.CV]
  50. Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders. arXiv:2212.06785 [cs.CV]
  51. Skeleton-Intrinsic Symmetrization of Shapes. Computer Graphics Forum 34, 2 (2015), 275–286.
  52. Qingnan Zhou and Alec Jacobson. 2016. Thingi10K: A Dataset of 10,000 3D-Printing Models. arXiv preprint arXiv:1605.04797 (2016).
  53. AdaCoSeg: Adaptive Shape Co-Segmentation with Group Consistency Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8543–8552.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Itai Lang (17 papers)
  2. Fei Xu (117 papers)
  3. Dale Decatur (4 papers)
  4. Sudarshan Babu (4 papers)
  5. Rana Hanocka (32 papers)

Summary

Interactive 3D Segmentation via Interactive Attention

The paper introduces iSeg, a novel approach for interactive 3D shape segmentation leveraging user input through clicks on the shape's surface. This method circumvents the limitations of 2D foundation models when applied to 3D segmentation by offering a system that operates natively in 3D space. Traditionally, 3D segmentation methods depend heavily on datasets with pre-defined semantic parts, which constrains their applicability. The iSeg approach addresses these challenges by operating on the mesh itself, allowing user-directed segmentations for various shapes without requiring exhaustive pre-determined labeling.

Methodological Contributions

iSeg is built around a foundation of two core components: an encoder that distills features from a 2D segmentation model into a mesh-specific feature field (MFF), and a decoder that combines this feature field with user clicks to predict the desired segmentation. The key advancement in iSeg is its interactive attention module, which processes variable numbers of clicks—both positive and negative—to steer segmentation customization. This enables a single model to adapt to diverse user interaction patterns, enhancing its flexibility.

Training iSeg involves distilling the semantic features from a pre-trained 2D model, ensuring these features are coherent and consistent across multiple views by operating entirely within the 3D domain. The system substantiates its learning by utilizing projections of user-specified regions into 2D views, obtaining training supervision from an existing powerful 2D backbone. This strategic use of pre-trained resources allows iSeg to segment regions that are complex or impossible to delineate purely based on text descriptions.

Empirical Evaluation

The empirical evaluation of iSeg confirms its versatility and high fidelity in segmenting 3D models across various domains, from humanoids to complex animals and manufactured objects. Stability and consistency are significant metrics where iSeg demonstrates improvements over 2D-centric methods, attributed primarily to its direct manipulation of 3D data. This inherently overcomes challenges associated with occlusion and the need for coherence across multiple viewpoints.

Regarding practical implications, iSeg's interactive capability makes it a tool poised to enhance workflows in 3D modeling environments where user-driven modifications to mesh segments are common. The potential applications in CAD modeling, animation, and virtual reality environments underscore its significance.

Theoretical Implications and Future Directions

The paper's contribution to the domain also extends to theoretical implications regarding the fusion of 2D and 3D data. By innovating a method to distill and utilize 2D feature fields in a consistent 3D manner, this work lays foundational principles for future research in 3D segmentation. It opens pathways to explore how interactive attention models can be further enhanced, perhaps with more sophisticated understanding of user intent or even adapting beyond simple click inputs to gestures or other interactive modalities.

Future developments may include enhancing the robustness of iSeg to operate seamlessly across a broader range of mesh complexities and vertex densities. Further investigations could also delve into optimizing the computational efficiency of the system, especially in scenarios with extremely large 3D models, or adapting the system for concurrent multi-user interactive environments.

In summary, the paper presents a technically proficient method that extends the capacity of interactive 3D segmentation by integrating advanced user interaction techniques and leveraging foundational reuses of 2D models, providing a substantive leap in both practical application and theoretical exploration in the field.

X Twitter Logo Streamline Icon: https://streamlinehq.com