Open-Vocabulary 3D Ambiguity Detection

Updated 16 January 2026

The paper introduces a novel methodology that decouples perception and reasoning to detect ambiguity in 3D instructions, ensuring a unique and safe command interpretation.
AmbiVer leverages adaptive keyframe selection, ray-based fusion, and multi-view evidence to robustly interpret and disambiguate free-form natural language commands in 3D scenes.
Empirical evaluations on the Ambi3D benchmark demonstrate that AmbiVer outperforms existing baselines, achieving superior accuracy and F1 scores in safety-critical applications.

Open-vocabulary 3D instruction ambiguity detection is a foundational task in embodied AI, where a model must determine whether a natural language command issued with respect to a 3D environment is unambiguous—i.e., admits a single concrete interpretation given the present scene. This problem is especially critical for safety in high-stakes domains (e.g., surgical robotics), where ambiguous language can induce catastrophic downstream errors. The formalism, dataset resources, algorithmic solutions, and empirical results for this task define a new paradigm in reliability for multimodal AI agents (Ding et al., 9 Jan 2026).

1. Formal Task Definition and Ambiguity Taxonomy

The open-vocabulary 3D instruction ambiguity detection task takes as input a 3D scene representation $S$ and a free-form instruction $T$ . The task output is a binary ambiguity label $y\in\{\text{Unambiguous},\ \text{Ambiguous}\}$ or, optionally, a confidence score in $[0,1]$ .

Scene and Instruction Representation

3D Scene $S$ :
- Egocentric video stream: $V=\{I_t\}_{t=1}^n$ , with corresponding camera poses $E=\{E_t\}_{t=1}^n$ .
- Allocentric 3D reconstruction: $P=\mathcal{R}(\{(I_t, E_t)\})$ .
- BEV (bird’s-eye view) projection: $I_{\text{bev}}$ .
- Multi-view keyframes: $\{I_v\}$ , selected via pose deviation thresholds.
Instruction $T$ : Free-form, open-vocabulary natural language.

Output

A mapping $F : (S, T) \to y$ or $F : (S, T) \to \sigma(z)$ , with $\sigma(z)=1/(1+e^{-z})$ as a classifier logit sigmoid.

Ambiguity Types

Referential Ambiguity:
- Instance ambiguity: Multiple referents for an object class label (e.g., “the cup” with several cups in view).
- Attribute ambiguity: Adjectives insufficient to isolate a referent (e.g., “large chair” when two are similarly sized).
- Spatial ambiguity: Instructional locatives affected by observer viewpoint (e.g., “to the left of the vase”).
Execution Ambiguity: Verb admits multiple plausible actions (e.g., “handle the bicycle” could mean push, lift, rotate, etc.).
An instruction is unambiguous only if it uniquely specifies the object(s) and prescribes a single clear action.

2. Ambi3D Benchmark: Dataset Design and Annotation

Ambi3D provides a comprehensive benchmark for evaluating ambiguity detection with 22,081 instructions covering 703 ScanNet-derived 3D scenes.

Data Generation Pipeline

Grounded Instructions (37.2%): LLM-converted ScanQA question–answer pairs to executable commands.
Synthetic Ambiguities (34.1%): LLM generation via scene metadata and templates for each subtype of ambiguity.
Hard Negatives (28.7%): Manually authored (by experts) unambiguous commands that appear ambiguous at first glance.

Annotation Protocol

12 trained annotators; three-way annotation for binary ambiguity and subtype.
Only samples with unanimous agreement are kept for binary labels; sub-type by majority vote.
Outsamples with <2 votes per type are discarded.

Dataset Statistics

Split	Scenes	Instructions	Unambiguous	Ambiguous	Inst.	Action	Attr.	Spatial
Training	649	19,950	9,407	10,543	~48%	~20%	~19%	~14%
Test	54	2,131	1,073	1,058	~46%	~20%	~19%	~15%
Total	703	22,081	10,480	11,601	5,333	2,302	2,216	1,750

Instruction lengths: mean 8.08 words; substantial spread ( $\sim$ 28% short, $\sim$ 22% long).

Scene Preprocessing

Point cloud reconstruction $P = \mathcal{R}(\{(I_t,E_t)\})$ .
BEV image creation $I_{\text{bev}} = \mathcal{T}(P,E_{\text{top}})$ .
Adaptive selection for $N_t \approx 100$ keyframes.
Multi-view 2D object detection (Grounding DINO) using parsed query $Q_t$ .
Ray-based clustering to form candidate object groups using proximity, angle, and area thresholds.
Top-K candidate selection (typically $K=6$ ) by group confidence scores for downstream reasoning.

3. Baselines and Empirical Evaluation

Multiple large vision-language and 3D LLM architectures are adapted for zero-shot ambiguity detection. Each is assessed via a rule-based extraction of binary ambiguity from natural language outputs.

Compared Methods

3D-LLM (NeurIPS’23)
Chat-Scene (NeurIPS’24)
Video-3D LLM (CVPR’25)
LSceneLLM (CVPR’25)
LLaVA-3D (ICCV’25)
AmbiVer (proposed)

Quantitative Comparison

Method	Acc	Prec	Rec	F1	Inst.	Attr.	Spatial	Action	Unamb.
3D-LLM	49.16	56.93	13.28	21.54	13.6	14.9	11.7	12.3	88.9
Chat-Scene	47.73	50.75	17.58	26.11	18.9	11.8	17.1	20.3	81.1
Video-3D	50.10	60.65	14.31	23.16	12.9	35.8	7.0	2.4	89.7
LSceneLLM	48.78	58.04	9.07	15.70	9.4	11.0	6.1	8.9	92.7
LLaVA-3D	64.21	63.93	73.17	68.24	59.6	95.6	62.9	91.0	54.3
AmbiVer	81.29	84.23	79.23	81.65	80.1	84.0	75.1	75.6	83.6

Failure analyses indicate off-the-shelf models exhibit pronounced “unambiguous bias” (low recall on ambiguous; inflated precision on clear cases) or “ambiguous bias” (misclassifying clear instructions). No baseline except LLaVA-3D achieves zero-shot F1 > 30%. AmbiVer demonstrates a substantial margin in both zero-shot and supervised settings.

4. AmbiVer: Two-Stage Ambiguity Detection Architecture

AmbiVer operationalizes ambiguity adjudication as a two-stage pipeline: an explicit evidence collection (“Perception Engine”) followed by cross-modal reasoning (“Reasoning Engine”).

Stage 1: Perception Engine (Visual Evidence Collection)

Instruction Decoupling: Leveraging spaCy, parse $T$ into {action, query $Q_t$ , attributes, relations}.
Global Feature Acquisition: Reconstruct 3D point cloud and generate BEV image.
Detailed Feature Acquisition:
- Adaptive selection of $N_{\text{target}}=100$ keyframes.
- Per-keyframe open-vocabulary detection (Grounding DINO) with $Q_t$ .
- Ray-based fusion merges multi-view detections into groups $G_k$ by spatial and angular consistency.
- Reliability scoring:
$S_k = \frac{\sum_{d_i\in G_k} s_i\cdot \text{area}(b_i)}{\sum_{d_i\in G_k} \text{area}(b_i)}$ - Candidate selection: take top $K=6$ groups, representative views by $f(d_i)=s_i \cdot w_{\text{vis}}(d_i) \cdot w_{\text{bnd}}(d_i)$ .

Stage 2: Reasoning Engine (Vision–Language Judgment)

Dossier Construction: Bundle $T$ , its parse, BEV map, and cropped candidate evidence.

Zero-shot VLM Adjudication: Use Qwen-3-VL, prompted with the assembled dossier and ambiguity criteria, to return a structured verdict:

{
  "label": "Ambiguous"/"Unambiguous", 
  "types": [...], 
  "explanation": "...", 
  "clarification_query": "..."
}

(Optional) Fine-Tuning: If trained end-to-end on Ambi3D with LoRA, the loss is:

$\mathcal{L} = \mathcal{L}_{CE}(y,\hat{y}) + \lambda\,\mathcal{L}_{\text{view\_rank}} + \mu\,\mathcal{L}_{\text{bev\_consistency}}$

AmbiVer’s full algorithmic flow is as follows:

function AmbiVer(V, E, T):
    # 1. Perception Engine
    {action, Q_t, attrs, rels} ← parse_instruction(T)
    P ← reconstruct_point_cloud(V, E)
    I_bev ← project_to_bev(P)
    keyframes ← adaptive_keyframe_selection(V, E, N_target=100)
    D ← {}
    for each (I_v, pose_v) in keyframes:
        D += grounding_dino(I_v, Q_t)
    G ← ray_based_fusion(D, ε_d, θ_a, σ_s)
    C ← select_top_K(G, K=6)  # includes S_k, representative crops
    # 2. Reasoning Engine
    Dossier ← bundle_evidence(T, {action,Q_t,attrs,rels}, I_bev, C)
    verdict ← zero_shot_vlm(Dossier)
    return verdict

5. Experimental Results and Ablation

Evaluation Setup

Hardware: Tesla V100 / A100 GPUs (4 × 32 GB).
Key parameters: $N_{\text{target}}=100$ , $\epsilon_d=0.3$ m, $\theta_{a,\max}=60^\circ$ , $\sigma_s=0.2$ , $K=6$ , $\delta=4$ px.
VLM: Qwen-3-VL with temperature=0.
Fine-tuning: 3 epochs, AdamW lr= $2\times10^{-4}$ , cosine schedule, wd=0.01, warmup=3%, rank $r=8$ , $\alpha=16$ , dropout=0.1.

Main Results

AmbiVer zero-shot achieves $81.29\%$ accuracy and $81.65\%$ F1, outperforming all baselines and ablation variants.
Strong gains of $+17$ – $+55$ pts F1 compared to best zero-shot, and $+2.4$ pts F1 over best supervised baseline.

Ablation Studies

Ablation Case	Acc	F1
w/o parsing	62.06%	65.62%
w/o adaptive keyframes	65.04%	54.56%
w/o 3D fusion	58.59%	38.42%
w/o representative view weights	79.93%	80.31%
Full pipeline (Ours)	81.29%	81.65%

Reasoning Engine Ablations	Acc	F1
w/o BEV map	56.95%	70.57%
w/o local evidence	60.71%	53.56%
w/o all visual info	53.99%	56.91%
Full reasoning (Ours)	81.29%	81.65%

Qualitative Case Studies

“Pick up the backpack”: Unambiguous, single candidate.
“Pass me the trash can”: Ambiguous (Instance), two plausible candidates.
“Please handle the bicycle”: Ambiguous (Action), multiple plausible actions.
“Adjust the pillow on the couch”: Mixed spatial/action ambiguity, VLM generates clarification queries.

6. Open Problems and Future Directions

Grounded ambiguity detection is established as a critical upstream safety check in embodied agents and robotics, providing an objective, evidence-based adjudication step before interpreting or executing 3D language commands.

Limitations

Reliance on pre-trained open-vocabulary detectors (e.g., Grounding DINO) is a vulnerability under occlusion and unseen categories.
Dependence on prompt design exposes reasoning to rare edge-case misparses.
The framework yields only a binary verdict; it does not currently integrate active clarification dialogue or uncertainty awareness.

Future Directions

Development of multi-turn clarification dialogue agents, leveraging the VLM verdict explanations.
Integration of uncertainty-aware planning to mediate between execution and user query.
Adaptation to dynamic/outdoor scenes, temporal ambiguity.
Joint or end-to-end training for the perception and reasoning stages, potentially unlocking differentiable evidence pathways.

Open-vocabulary 3D instruction ambiguity detection thus forms a research agenda at the intersection of vision-language understanding, embodied AI safety, and pragmatic language interpretation (Ding et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Open-Vocabulary 3D Instruction Ambiguity Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-Vocabulary 3D Instruction Ambiguity Detection.