ChangingGrounding: 3D Visual Grounding in Changing Scenes (2510.14965v1)

Published 16 Oct 2025 in cs.CV

Abstract: Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

Summary

The paper’s primary contribution is the formulation of the ChangingGrounding task, redefining 3D visual grounding by integrating memory with active exploration in dynamic scenes.
It presents the Mem-ChangingGrounder framework that combines large vision-language models with multi-view projection, achieving 36.8% [email protected] with reduced exploration cost.
The approach sets a new benchmark by offering a robust trade-off between localization accuracy and exploration efficiency, with implications for real-world robotic perception.

Memory-Driven 3D Visual Grounding in Dynamic Scenes: The ChangingGrounding Benchmark and Mem-ChangingGrounder

Motivation and Problem Formulation

The paper introduces a new paradigm for 3D visual grounding (3DVG) in dynamic environments, where the scene may change between observations. Traditional 3DVG methods assume access to a complete, up-to-date point cloud of the environment, which is impractical for real-world robotics due to the high cost of repeated full-scene scanning. The authors argue for a memory-driven approach, where an agent leverages past observations (memory) and selectively explores the current scene to efficiently and accurately localize objects described by natural language queries.

The ChangingGrounding task is formally defined as: given memory $M_p$ (RGB-D images and poses) from a previous scene $S_p$ , an unexplored current scene $S_c$ , and a query $D_c$ describing a target object, predict the 3D bounding box $B$ of the target in $S_c$ . The evaluation jointly considers localization accuracy and exploration cost, the latter measured by both the number of exploratory actions and the physical motion required.

Figure 1: Comparison between the previous setting of 3DVG and the ChangingGrounding task.

ChangingGrounding Dataset and Benchmark

To support this new task, the authors construct the ChangingGrounding dataset, based on the 3RScan dataset, which provides temporally separated scans of the same indoor environments with object correspondences and transformations. The dataset generation pipeline involves:

Generating spatial-relation-based natural language queries using a template: ⟨Target Category⟩ ⟨Spatial Relation⟩ ⟨Anchor Category⟩ (e.g., "the chair farthest from the cabinet").
Filtering object categories and spatial relations to ensure robust, diverse, and unambiguous queries.
Aligning all scans to a global coordinate system and rendering standardized RGB-D images for both memory and exploration.

The resulting dataset contains 266,916 referential descriptions, making it the largest and only benchmark for 3D visual grounding in changing environments.

Figure 2: ChangingGrounding Dataset generation pipeline.

A word cloud of the spatial-relation descriptions highlights the diversity and frequency of object categories and relations.

Figure 3: A word cloud generated from spatial-relation descriptions, visually highlighting the frequency of occurring terms.

Mem-ChangingGrounder: A Zero-Shot Memory-Driven Baseline

The Mem-ChangingGrounder (MCG) framework is proposed as a strong zero-shot baseline for the ChangingGrounding task. MCG is designed to operate without task-specific training, instead leveraging large vision-LLMs (VLMs) and open-vocabulary detectors. The workflow consists of four core modules:

Query Classification: Determines whether the query is verifiable (can be resolved from memory if the scene is unchanged) or unverifiable (requires exploration due to possible scene changes).
Memory Retrieval and Grounding: Uses memory to locate anchor or target objects, then guides exploration using two action policies:
- Omnidirectional Scene Scanner (OSS): 360° scans from a pose to find anchor/target objects.
- Spatial Relation Aware Scanner (SRAS): Explores from the anchor's pose, guided by the spatial relation in the query.
Fallback: If initial grounding fails, retrieves the clearest memory image of the target class and initiates a new search.
Multi-View Projection: After identifying the target image, uses VLMs and SAM to segment the object, projects masks into 3D, and fuses multi-view observations for accurate bounding box estimation.
Figure 4: Workflow of Mem-ChangingGrounder (MCG). The upper part shows the overall pipeline: MCG classifies queries, retrieves memory, uses OSS and SRAS to search, applies fallback when needed, and predicts the 3D bounding box through multi-view projection. The lower part shows details of OSS, SRAS, and Multi-view Projection.

Experimental Results and Analysis

MCG is evaluated against three baselines:

Wandering Grounding (WG): Exhaustive exploration of the current scene.
Central Rotation Grounding (CRG): 360° rotation at the scene center.
Memory-Only Grounding (MOG): Uses only memory, no exploration.

MCG achieves the highest localization accuracy (36.8% [email protected] in high-res) while maintaining significantly lower exploration cost compared to WG. CRG and MOG have lower costs but substantially reduced accuracy, demonstrating the necessity of both memory and targeted exploration. Notably, MCG's memory-driven policy enables a favorable trade-off between efficiency and precision.

Ablation studies confirm that memory access dramatically reduces exploration cost without sacrificing accuracy, and that multi-view projection and fallback strategies further improve robustness. The choice of VLM is critical: GPT-4.1 outperforms GPT-4o, indicating that advances in VLMs directly translate to better grounding performance.

Failure Modes and Limitations

The authors provide a detailed analysis of failure cases:

VLM Failures: Inaccurate anchor or target identification due to limited VLM spatial reasoning, especially for fine-grained or relational queries.
SRAS/OSS Limitations: Insufficient viewpoint coverage can miss objects, particularly those at unusual heights or occluded positions.
Segmentation and Projection Errors: SAM and projection steps are sensitive to rendering artifacts and detector errors, leading to noisy or incomplete 3D masks.
Figure 5: VLMs failure in memory retrieval, the anchor object is a box.

Figure 6: Failure in SRAS, the user query is to find the cushion that is farthest from the PC.

Figure 7: VLMs fail to ground the target image: query “cabinet near the box.”

Figure 8: Failure in SAM and projection.

Qualitative Demonstration

A full demonstration case illustrates the stepwise reasoning and perception process of MCG, showing how memory retrieval, spatial relation parsing, targeted exploration, and multi-view fusion are orchestrated to localize the target object in a changing scene.

Figure 9: Case of the MCG grounding part-1.

Figure 10: Case of the MCG grounding part-2.

Figure 11: Case of the MCG grounding part-3.

Implications and Future Directions

This work establishes a new research direction for 3D visual grounding in dynamic environments, emphasizing the importance of memory-driven, active perception. The ChangingGrounding benchmark and MCG baseline provide a foundation for developing more practical and efficient 3DVG systems for robotics and embodied AI.

Key implications and open problems include:

VLM Robustness: Improving spatial reasoning and relational understanding in VLMs is essential for further gains.
Multimodal Integration: Tighter coupling of visual, linguistic, and spatial cues can enhance grounding accuracy.
Benchmark Expansion: Incorporating more diverse scene changes (lighting, appearance, dynamic interactions) and allocentric relations will increase realism and challenge.
Efficient 2D-3D Fusion: Advances in segmentation and projection pipelines are needed to reduce noise and improve 3D localization.

Conclusion

The ChangingGrounding benchmark and Mem-ChangingGrounder baseline redefine 3D visual grounding as an active, memory-centric problem, moving beyond static scene assumptions. By jointly optimizing for accuracy and exploration cost, this work provides a practical framework for real-world robotic perception in dynamic environments and sets the stage for future research in memory-augmented embodied AI.