Monocular 3D Visual Grounding (Mono3DVG)

Updated 17 November 2025

Mono3DVG is a task that recovers metric 3D properties (location, size, orientation) from a single RGB image using both appearance and geometric textual cues.
The approach employs transformer-based architectures with dual text-guided adapters and pixel-wise feature alignment to fuse visual and spatial information.
Experimental results on the Mono3DRefer benchmark show improved accuracy in 3D localization, especially in far-range and occluded scenarios.

Monocular 3D Visual Grounding (Mono3DVG) refers to the automatic localization and identification of a three-dimensional object in a single RGB image using a free-form text description where the expression includes both visual (appearance) and geometric (spatial, quantitative) cues. The defining feature of the task is the recovery of metric 3D properties (location, size, orientation) from monocular input without external depth sensors, relying instead on both image cues and explicit textual geometry. Mono3DVG’s emergence has led to the creation of specialized datasets (notably Mono3DRefer) and a series of end-to-end transformer-based models leveraging multi-modal learning.

1. Task Definition and Dataset Foundation

At the core of Mono3DVG is the mapping from an image-text pair to a unique 3D bounding box:

Input:
- RGB image $I\in\mathbb{R}^{H\times W\times 3}$
- Natural-language expression $T = \{w_1, w_2, \ldots, w_N\}$
Output:
- A single 3D bounding box $B = (x, y, z, w, h, l, \theta)$
- $(x, y, z)$ : center in camera/world coordinates
- $(w, h, l)$ : size (width, height, length)
- $\theta$ : yaw/orientation

The Mono3DRefer dataset is the principal benchmark:

2,025 images (KITTI), 8,228 3D boxes, 41,140 text expressions
Average expression: 53 tokens, vocab size: 5,271
Objects range up to 102 meters, enabling far-range metric evaluation beyond standard RGB-D and LiDAR datasets
Descriptions include both appearance and explicit geometric (distance, size, relative position) information, produced via prompt engineering and verified by annotators (Zhan et al., 2023)

Annotations were generated by a structured pipeline:

Attribute extraction (color, occlusion, 3D properties)
Prompt-based sentence generation using ChatGPT
Manual verification for referential clarity and uniqueness

Mono3DVG-TR inaugurates the end-to-end transformer architecture for this task (Zhan et al., 2023):

Text encoder: RoBERTa-base plus linear projection to $C=256$ channels
Visual encoder: ResNet-50 backbone with multi-scale feature outputs and a depth predictor based on light-weight convolutional/transformer layers
Dual text-guided adapters: Cross-attention modules for both visual and geometric feature branches, refining these streams using textual guidance
Grounding decoder: Stacked modules performing sequential attention over depth, text, visual cues
Prediction head: Outputs class logits, 2D/3D box parameters, and depth via fully-connected networks

Feature alignment is achieved via pixel-wise similarity (Gaussian mapping) between text- and vision-guided features, allowing dense correspondence between modalities.

Loss Functions

Three principal loss branches:

2D localization:

$\mathcal{L}_{2D} = \lambda_1\mathcal{L}_{class} + \lambda_2\mathcal{L}_{lrtb} + \lambda_3\mathcal{L}_{GIoU} + \lambda_4\mathcal{L}_{xy3D}$

3D attributes:

$\mathcal{L}_{3D} = \mathcal{L}_{size3D} + \mathcal{L}_{orien} + \mathcal{L}_{depth}$

Depth supervision: Per-pixel focal loss.

Total loss:

$\mathcal{L}_{overall} = \mathcal{L}_{2D} + \mathcal{L}_{3D} + \mathcal{L}_{dmap}$

3. Key Methodological Advances

Recent state-of-the-art advances address intrinsic challenges in visual-language grounding for monocular 3D:

Mitigates the model’s propensity to over-rely on explicit, high-certainty tokens (e.g., “red car”).
Each token is scored for “certainty” via cosine similarity $s_i = \cos(t_i, v)$ between word embedding and visual features from a CLIP ViT backbone, followed by k-means clustering.
During training, high-certainty words are masked (“***”), forcing the model to leverage implicit, low-certainty (primarily spatial/geometric) cues.
At inference, the full original expression is used.
Pseudocode:

for each batch sample (I, T, B*):
    I_crop = crop_image(I, B*)
    v = CLIP_Vision(I_crop)
    for i in range(N):
        t_i = CLIP_Text(w_i)
        s_i = cosine(t_i, v)
    labels = kmeans(s, k=2)
    T_masked = ["***" if labels[i]=="high" else w_i for i in range(N)]
    T_out = RoBERTa(T_masked)

Addresses cross-dimensional interference by disentangling text features into separate 2D and 3D guides.
Learnable queries $L_{2D}, L_{3D}$ extract dimension-specific signals via multi-head cross-attention:

$H_{2D} = \mathrm{MHCA}(L_{2D}, T_t), \quad H_{3D} = \mathrm{MHCA}(L_{3D}, T_t)$

Reverse attention mechanisms invert the cross-attention maps to force separation of dimension-specific content.
Refined visual and depth features are then updated by dimensionally matched (2D/3D) text features only.
This mitigates contamination such as “red” (2D) polluting depth (3D) inference and vice versa.

Pre-trained LLMs display insensitivity to unit equivalences: “10 meters” vs. “1000 centimeters” yield divergent embeddings.
3D-Text Enhancement (3DTE): Randomly remap each numerical span among meters, decimeters, and centimeters during training, compelling the model to attend to metric equivalence.
- For each “number + unit” in the query:
- Sample new unit
- Adjust number: $v' = v_0 \times (c(u) / c(u_0))$
- Replace segment
Text-Guided Geometry Enhancement (TGE): Injects 3D-enhanced text feature projections into the geometry branch via MHCA, yielding a text-calibrated geometric representation.

4. Experimental Protocols and Benchmarks

Models are evaluated on Mono3DRefer, using [email protected] and [email protected] (percentage of predictions with IoU above 0.25/0.5) as principal metrics. Evaluation is stratified by:

Unique vs. Multiple: is the referent category unique in the scene?
Distance ranges: Near (0–15m), Medium (15–35m), Far (>35m)
Occlusion difficulty: Easy, Moderate, Hard

Implementation Highlights

Training: AdamW, 60 epochs, batch = 10, lr = 1e-4, weight decay = 1e-4, dropout = 0.1
Hardware: GTX 3090 (24 GB)
Backbones: RoBERTa-base, ResNet-50 for 2D, lightweight depth network for 3D
No additional sensors (monocular RGB only); no explicit language augmentation unless noted

5. Quantitative Results

Performance Comparison Table

Method	[email protected] / [email protected]	Far [email protected]
Mono3DVG-TR (Zhan et al., 2023)	64.36 / 44.25	15.35
CLIP-LCA only	66.57 / 49.29	–
D2M only	68.11 / 51.08	–
Mono3DVG-EnSD (Li et al., 10 Nov 2025)	69.51 / 52.85	28.89 (+13.54)
3DTE only (Li et al., 26 Aug 2025)	– / 50.25	–
TGE only (Li et al., 26 Aug 2025)	– / 49.29	–
3DTE+TGE (Li et al., 26 Aug 2025)	– / 51.21	27.29 (+11.94)

These results establish that both CLIP-LCA (keyword masking) and D2M (dimensionally consistent feature splitting) individually and jointly yield substantial performance gains. Monocular 3D visual grounding at large ranges (e.g., Far at >35m) shows the greatest improvement compared to prior work.

Additional findings:

Random unit conversion (3DTE) improves text-feature consistency under unit remapping, increasing cosine/Euclidean similarity and bolstering generalization to unseen units (e.g., “millimeter” never seen at training).
CLIP-LCA shifts attention spatially from high-certainty appearance phrases to spatial/geometric relations, e.g., “behind the white van, 10m away.”

6. Analysis, Limitations, and Research Directions

Mono3DVG research demonstrates:

Spatial Reasoning: CLIP-LCA enforces the model’s dependence on spatial concepts, which are critical for disambiguating objects when visual semantics (color, class) are insufficient or ambiguous.
Modality Alignment: Cross-dimensional interference, where generalized language encodings pollute visual grounding, is substantially mitigated by D2M. Each modality’s features are modulated by their matching-dimensional textual cues only.
Metric-Invariance: Text encoders’ default behavior of treating equidistant but differently unitized descriptors as distinct can be corrected by simple, consistent data augmentations.

Limitations:

Distant-object depth estimation remains challenging, although state-of-the-art methods now improve [email protected] on “Far” scenarios from ~15% to nearly 29%.
Methods still generally require text queries to contain explicit geometric relations; performance degrades on purely appearance-based queries or free-form spatial language.
Current datasets (e.g., Mono3DRefer) use prompts to enforce clarity and explicitness of referential expressions, potentially limiting generalization to everyday language.

Potential research avenues include:

Incorporating more free-form, non-metric spatial language
Handling multi-query or sequential instructions
Stronger unsupervised and few-shot approaches for rare spatial constructs
Synergistic integration with monocular 3D detection advances for broader scene understanding

7. Summary Table of Representative Methods

Model	Main Innovations	Key Results ([email protected])
Mono3DVG-TR (Zhan et al., 2023)	Transformer + dual adapter, no language aug	44.25
Mono3DVG-TGE (Li et al., 26 Aug 2025)	3DTE unit randomization, TGE attention	51.21 (+6.96)
Mono3DVG-EnSD (Li et al., 10 Nov 2025)	CLIP-LCA token masking, D2M decoupling	52.85 (+8.60)

All methods are evaluated on Mono3DRefer under consistent protocols.

Mono3DVG establishes a challenging and representative benchmark for joint vision-language and spatial reasoning under realistic monocular constraints. Innovations targeting the leveraging of spatial phrases, mitigation of language-induced aliasing, and dimensionally aware feature fusion have rapidly advanced the state of the art, with strong performance reported under stringent evaluation at large distances and diverse occlusion conditions.

PDF Markdown Chat (Pro)

References (3)

Mono3DVG: 3D Visual Grounding in Monocular Images (2023)

Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding (2025)

Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding (2025)

Follow Topic

Get notified by email when new papers are published related to Monocular 3D Visual Grounding (Mono3DVG).

Monocular 3D Visual Grounding (Mono3DVG)

1. Task Definition and Dataset Foundation

Loss Functions

3. Key Methodological Advances

3.1. CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) (Li et al., 10 Nov 2025)

3.2. Dimension-Decoupled Module (D2M) (Li et al., 10 Nov 2025)

3.3. Metric-Aware Textual Processing (Li et al., 26 Aug 2025)

4. Experimental Protocols and Benchmarks

Implementation Highlights

5. Quantitative Results

Performance Comparison Table

6. Analysis, Limitations, and Research Directions

7. Summary Table of Representative Methods

Follow Topic

Continue Learning

Monocular 3D Visual Grounding (Mono3DVG)

1. Task Definition and Dataset Foundation

2. Model Architectures and Multi-Modal Encoding

Loss Functions

3. Key Methodological Advances

3.1. CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) (Li et al., 10 Nov 2025)

3.2. Dimension-Decoupled Module (D2M) (Li et al., 10 Nov 2025)

3.3. Metric-Aware Textual Processing (Li et al., 26 Aug 2025)

4. Experimental Protocols and Benchmarks

Implementation Highlights

5. Quantitative Results

Performance Comparison Table

6. Analysis, Limitations, and Research Directions

7. Summary Table of Representative Methods

Follow Topic

Continue Learning

Related Topics