Papers
Topics
Authors
Recent
2000 character limit reached

3EED: Ground Everything Everywhere in 3D

Published 3 Nov 2025 in cs.CV and cs.RO | (2511.01755v1)

Abstract: Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-LLM prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

Summary

  • The paper presents 3EED, a novel benchmark that scales outdoor 3D visual grounding with multi-platform, multi-modal data and extensive language annotations.
  • It introduces platform-aware normalization and multi-scale fusion techniques that substantially improve in-domain and cross-platform performance.
  • The study establishes rigorous protocols for multi-object reasoning and spatial disambiguation, highlighting challenges in diverse outdoor scenes.

3EED: A Multi-Platform, Multi-Modal Benchmark for 3D Visual Grounding

Introduction

The 3EED dataset and benchmark address the limitations of prior 3D visual grounding resources, which have been constrained to indoor environments, single-platform data, and limited scale. 3EED introduces a large-scale, multi-platform, multi-modal corpus for 3D grounding in outdoor scenes, integrating synchronized LiDAR and RGB data from vehicle, drone, and quadruped platforms. The benchmark comprises over 128,000 annotated objects and 22,000 human-verified referring expressions, representing a tenfold increase in scale over previous outdoor datasets. The work establishes new protocols for in-domain, cross-platform, and multi-object grounding, and proposes platform-aware normalization and cross-modal alignment techniques to facilitate generalizable 3D grounding. Figure 1

Figure 1: Multi-modal, multi-platform 3D grounding: given a scene and a structured natural language expression, the task is to localize the referred object in 3D space across vehicle, drone, and quadruped platforms.

Dataset Construction and Annotation Pipeline

3EED unifies data from three embodied platforms: vehicles (Waymo Open Dataset), drones, and quadrupeds (M3ED). The annotation pipeline is designed for scalability and quality, combining vision-LLM prompting (Qwen2-VL-72B) with human-in-the-loop verification. The process involves:

  • Pseudo-label seeding: Multiple state-of-the-art 3D detectors (PV-RCNN, CenterPoint, etc.) generate initial bounding boxes.
  • Automatic consolidation: Kernel density estimation merges detector outputs, multi-object tracking enforces temporal coherence, and the Tokenize Anything model projects boxes onto RGB for semantic validation.
  • Human refinement: Annotators verify and correct boxes and referring expressions using a custom UI, ensuring platform-invariant, unambiguous language. Figure 2

    Figure 2: Annotation workflow: multi-detector fusion, tracking, filtering, and manual verification for 3D boxes; VLM prompting and human refinement for referring expressions.

The dataset covers diverse outdoor scenes, with platform-specific differences in viewpoint geometry, object density, and LiDAR point cloud elevation. Vehicle data is characterized by mid-range, level views; drone data by top-down, sparse, high-density scenes; quadruped data by ground-level, close-range perspectives. Figure 3

Figure 3: Examples of multi-platform 3D grounding: discrepancies in sensory data and referring expressions across vehicle, drone, and quadruped agents.

Figure 4

Figure 4: Dataset statistics: polar distribution of target boxes, scene and object count histograms, and elevation biases per platform.

Benchmark Protocols and Model Design

The benchmark suite includes four diagnostic settings:

  1. Single-platform, single-object grounding: In-domain evaluation per platform.
  2. Cross-platform transfer: Zero-shot evaluation from vehicle-trained models to drone/quadruped data.
  3. Multi-object grounding: Localization of all objects described in a multi-object expression.
  4. Multi-platform grounding: Joint training on all platforms, with per-platform evaluation.

The baseline model adapts PointNet++ for scale-adaptive encoding, incorporates platform-aware normalization (CPA), multi-scale sampling (MSS), and scale-aware fusion (SAF). CPA aligns all scans to a gravity-consistent frame, MSS samples neighborhoods at multiple radii to address LiDAR sparsity and object scale variation, and SAF dynamically fuses multi-scale features per point.

Experimental Results

Single-Object Grounding

The baseline model demonstrates substantial improvements over prior methods (BUTD-DETR, EDA, WildRefer) in both in-domain and cross-platform settings. For example, when trained on vehicle data, the baseline achieves Acc@25 of 78.37% (vs. 52.38% for BUTD-DETR) and narrows the cross-platform gap (drone: 18.16% vs. 1.54%; quadruped: 36.04% vs. 10.18%). Unified multi-platform training yields balanced performance across all platforms.

Multi-Object Grounding

In multi-object scenarios, the baseline outperforms existing methods in both Acc@25 and mIoU, indicating superior joint reasoning and spatial disambiguation. The model's multi-scale and adaptive fusion modules are critical for handling complex scenes with multiple referents and varying object sizes. Figure 5

Figure 5: Multi-object 3D grounding: localizing each referred object by reasoning over semantic attributes and inter-object spatial relationships.

Ablation Studies

Component ablations confirm the complementarity of CPA, MSS, and SAF. Removing any module degrades both in-domain and cross-platform performance. Object density analysis reveals that grounding accuracy decreases as the number of objects per scene increases, especially for drone data due to extreme sparsity and clutter. Figure 6

Figure 6: Qualitative comparisons: baseline model yields precise, tightly aligned 3D boxes across platforms; prior methods struggle with viewpoint and density shifts.

Implementation Considerations

  • Computational Requirements: Training requires two RTX 4090 GPUs for 100 epochs; inference is efficient due to lightweight fusion modules.
  • Data Preprocessing: Uniform downsampling to 16,384 points per scene; gravity-aligned normalization for cross-platform consistency.
  • Annotation Quality: Human verification ensures high-fidelity language and spatial alignment; platform-invariant prompts standardize linguistic supervision.
  • Deployment: The model is robust to platform shifts and can be integrated into embodied agents for navigation, interaction, and situational awareness in outdoor environments.

Implications and Future Directions

3EED exposes significant performance gaps in cross-platform 3D grounding, underscoring the need for platform-aware, scale-adaptive models. The dataset's diversity and scale enable rigorous evaluation of generalization, multi-object reasoning, and language-vision alignment in real-world conditions. Future research should address temporal dynamics, ambiguous language, and extension to additional sensor modalities. The benchmark provides a foundation for developing robust, context-aware embodied perception systems.

Conclusion

3EED establishes a new standard for outdoor 3D visual grounding, offering a large-scale, multi-platform, multi-modal dataset and comprehensive benchmark protocols. The proposed annotation pipeline and baseline model advance the state of the art in generalizable 3D grounding, revealing key challenges and opportunities for future research in language-driven embodied AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.