Benefit of geometry-grounding in distilled radiance fields

Determine whether incorporating geometry-grounded semantic features—such as those produced by vision backbones trained with 3D reconstruction objectives (e.g., Visual Geometric Grounded Transformer, VGGT)—provides measurable advantages over visual-only semantic features (e.g., DINO and CLIP) when distilling semantics into radiance fields (Gaussian Splatting and neural radiance fields).

Background

Radiance fields such as Gaussian Splatting and neural radiance fields have been successfully combined with pretrained visual-only semantics (e.g., CLIP and DINO) to enable open-vocabulary robotics applications, including manipulation and navigation. Recent models like VGGT introduce geometry-grounded features by training with 3D reconstruction objectives, potentially offering improved spatial fidelity relevant to tasks such as pose estimation.

Despite these developments, prior work had not explicitly established whether geometry-grounding contributes tangible benefits when such features are distilled into radiance fields. This uncertainty motivates a systematic comparison between geometry-grounded and visual-only semantic features within distilled radiance fields to assess any performance gains or trade-offs.

References

While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question.

— Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields (2510.03104 - Mei et al., 3 Oct 2025) in Abstract

Benefit of geometry-grounding in distilled radiance fields

Background

References

Related Problems