Papers
Topics
Authors
Recent
2000 character limit reached

GARField: Group Anything with Radiance Fields (2401.09419v1)

Published 17 Jan 2024 in cs.CV and cs.GR

Abstract: Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
  2. Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion. NeurIPS, 2023.
  3. Segment anything in 3d with nerfs. 2023.
  4. Interactive segment anything nerf with feature imitation. arXiv preprint arXiv:2211.12368, 2023.
  5. Spectral segmentation with multiscale graph decomposition. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pages 1124–1131 vol. 2, 2005.
  6. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  7. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  8. Nerf-sos: Any-view self-supervised object segmentation on complex scenes. arXiv preprint arXiv:2209.08776, 2022.
  9. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 1735–1742. IEEE, 2006.
  10. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  11. Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  12. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  13. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
  14. Panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019.
  15. Segment anything. ICCV, 2023.
  16. Decomposing nerf for editing via feature field distillation. NeurIPS, 35:23311–23330, 2022.
  17. Grass: Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
  18. Instance neural radiacne field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  19. hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11):205, 2017.
  20. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  21. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019a.
  22. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019b.
  23. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  24. Learning unsupervised hierarchical part decomposition of 3d objects from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1060–1070, 2020.
  25. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence, 39(1):128–140, 2016.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  27. Neural volumetric object selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6133–6142, 2022.
  28. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
  29. Panoptic lifting for 3d scene understanding with neural fields. arXiv preprint arXiv:2212.09802, 2022.
  30. Piotr Skalski. Make Sense. https://github.com/SkalskiP/make-sense/, 2019.
  31. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 129–136, 2011.
  32. Nerfstudio: A modular framework for neural radiance field development. arXiv preprint arXiv:2302.04264, 2023.
  33. Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In 3DV, 2022.
  34. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. arXiv preprint arXiv:2111.13260, 2021.
  35. Symmetry hierarchy of man-made objects. In Computer graphics forum, pages 287–296. Wiley Online Library, 2011.
  36. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  37. Stella X Yu. Segmentation using multiscale cues. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., pages I–I. IEEE, 2004.
  38. In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.
Citations (29)

Summary

  • The paper presents a method to decompose 3D scenes into semantically meaningful, hierarchical groups using scale-conditioned radiance fields.
  • It refines 2D segmentation masks into a volumetric affinity field, effectively resolving overlapping and conflicting labels.
  • The approach demonstrates strong multi-view consistency and holds promise for applications in 3D asset extraction and dynamic scene understanding.

The Concept of Hierarchical Grouping in 3D Scenes

In the field of 3D scene understanding, the ability to differentiate and group various components of a scene is critical. This process can be complex due to the varying scales at which elements can be grouped. A single entity may be regarded as a standalone object or as part of a larger ensemble, depending on the level of detail desired. The core challenge lies in the inherent ambiguity of grouping, which necessitates a sophisticated method to establish a cohesive 3D representation of these varying groups.

Introducing GARField

Addressing this challenge, researchers introduce an innovative approach named GARField, which stands for Group Anything with Radiance Fields. The main objective of this method is to decompose a 3D scene into hierarchical groups that are semantically meaningful. GARField operates by creating a scale-conditioned 3D affinity feature field, which allows a point within the scene to be associated with different groups at various scales. Consequently, GARField can discern and distinguish the excavation equipment as a complete unit, while also identifying its individual components like wheels and the cabin module. The methodology refines a set of two-dimensional segmentation masks and translates them into a volumetric scale-conditioned affinity field, adeptly managing overlapping or conflicting labels.

Overcoming the Challenges

One of the notable hurdles encountered in this process is the discrepancy across different perspectives and the potential for conflicting recommendations from the segmentation masks. GARField tackles this by adopting a scale-conditioning feature field. The underlying principle involves optimizing a dense 3D feature field that assigns features such that the feature distance mirrors the affinity between points. This innovatively harnesses the physical scale to blend conflicting masks harmoniously. The impressive outcome is that GARField is adept at delivering more nuanced and detailed groupings than traditional mask inputs and does so with a consistency across various viewpoints.

Hierarchical Scene Decomposition and Practical Applications

After optimizing the affinity field, the GARField approach extracts a hierarchical structure of the 3D scene, forming clusters at descending scales. The process involves a tree construction methodology that guarantees each group is naturally subdivided into detailed subparts in a top-down fashion. Upon evaluation, the GARField method has demonstrated effective results across diverse scenes, capturing both object hierarchies and their consistent presentations from multiple angles. This promising technology has the potential to significantly advance various downstream applications such as extracting 3D assets and dynamic scene understanding.

The researchers conclude by suggesting future directions for GARField. Alternative methods for resolving group ambiguities, like incorporating object affordances into the grouping process, could further refine the model's capabilities. By continuously improving upon this technology, the field moves closer to having intricate 3D scene understanding tools that can adeptly handle the complexities of real-world environments.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 487 likes about this paper.