GARField: Group Anything with Radiance Fields (2401.09419v1)

Published 17 Jan 2024 in cs.CV and cs.GR

Abstract: Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/

Abstract PDF HTML Chat (Pro)

References (38)

Citations (29)

View on Semantic Scholar

Summary

The paper presents a method to decompose 3D scenes into semantically meaningful, hierarchical groups using scale-conditioned radiance fields.
It refines 2D segmentation masks into a volumetric affinity field, effectively resolving overlapping and conflicting labels.
The approach demonstrates strong multi-view consistency and holds promise for applications in 3D asset extraction and dynamic scene understanding.

The Concept of Hierarchical Grouping in 3D Scenes

In the field of 3D scene understanding, the ability to differentiate and group various components of a scene is critical. This process can be complex due to the varying scales at which elements can be grouped. A single entity may be regarded as a standalone object or as part of a larger ensemble, depending on the level of detail desired. The core challenge lies in the inherent ambiguity of grouping, which necessitates a sophisticated method to establish a cohesive 3D representation of these varying groups.

Introducing GARField

Addressing this challenge, researchers introduce an innovative approach named GARField, which stands for Group Anything with Radiance Fields. The main objective of this method is to decompose a 3D scene into hierarchical groups that are semantically meaningful. GARField operates by creating a scale-conditioned 3D affinity feature field, which allows a point within the scene to be associated with different groups at various scales. Consequently, GARField can discern and distinguish the excavation equipment as a complete unit, while also identifying its individual components like wheels and the cabin module. The methodology refines a set of two-dimensional segmentation masks and translates them into a volumetric scale-conditioned affinity field, adeptly managing overlapping or conflicting labels.

Overcoming the Challenges

One of the notable hurdles encountered in this process is the discrepancy across different perspectives and the potential for conflicting recommendations from the segmentation masks. GARField tackles this by adopting a scale-conditioning feature field. The underlying principle involves optimizing a dense 3D feature field that assigns features such that the feature distance mirrors the affinity between points. This innovatively harnesses the physical scale to blend conflicting masks harmoniously. The impressive outcome is that GARField is adept at delivering more nuanced and detailed groupings than traditional mask inputs and does so with a consistency across various viewpoints.

Hierarchical Scene Decomposition and Practical Applications

After optimizing the affinity field, the GARField approach extracts a hierarchical structure of the 3D scene, forming clusters at descending scales. The process involves a tree construction methodology that guarantees each group is naturally subdivided into detailed subparts in a top-down fashion. Upon evaluation, the GARField method has demonstrated effective results across diverse scenes, capturing both object hierarchies and their consistent presentations from multiple angles. This promising technology has the potential to significantly advance various downstream applications such as extracting 3D assets and dynamic scene understanding.

The researchers conclude by suggesting future directions for GARField. Alternative methods for resolving group ambiguities, like incorporating object affordances into the grouping process, could further refine the model's capabilities. By continuously improving upon this technology, the field moves closer to having intricate 3D scene understanding tools that can adeptly handle the complexities of real-world environments.