Open-Vocabulary 3D Ambiguity Detection
- The paper introduces a novel methodology that decouples perception and reasoning to detect ambiguity in 3D instructions, ensuring a unique and safe command interpretation.
- AmbiVer leverages adaptive keyframe selection, ray-based fusion, and multi-view evidence to robustly interpret and disambiguate free-form natural language commands in 3D scenes.
- Empirical evaluations on the Ambi3D benchmark demonstrate that AmbiVer outperforms existing baselines, achieving superior accuracy and F1 scores in safety-critical applications.
Open-vocabulary 3D instruction ambiguity detection is a foundational task in embodied AI, where a model must determine whether a natural language command issued with respect to a 3D environment is unambiguous—i.e., admits a single concrete interpretation given the present scene. This problem is especially critical for safety in high-stakes domains (e.g., surgical robotics), where ambiguous language can induce catastrophic downstream errors. The formalism, dataset resources, algorithmic solutions, and empirical results for this task define a new paradigm in reliability for multimodal AI agents (Ding et al., 9 Jan 2026).
1. Formal Task Definition and Ambiguity Taxonomy
The open-vocabulary 3D instruction ambiguity detection task takes as input a 3D scene representation and a free-form instruction . The task output is a binary ambiguity label or, optionally, a confidence score in .
Scene and Instruction Representation
- 3D Scene :
- Egocentric video stream: , with corresponding camera poses .
- Allocentric 3D reconstruction: .
- BEV (bird’s-eye view) projection: .
- Multi-view keyframes: , selected via pose deviation thresholds.
- Instruction : Free-form, open-vocabulary natural language.
Output
- A mapping or , with as a classifier logit sigmoid.
Ambiguity Types
- Referential Ambiguity:
- Instance ambiguity: Multiple referents for an object class label (e.g., “the cup” with several cups in view).
- Attribute ambiguity: Adjectives insufficient to isolate a referent (e.g., “large chair” when two are similarly sized).
- Spatial ambiguity: Instructional locatives affected by observer viewpoint (e.g., “to the left of the vase”).
- Execution Ambiguity: Verb admits multiple plausible actions (e.g., “handle the bicycle” could mean push, lift, rotate, etc.).
- An instruction is unambiguous only if it uniquely specifies the object(s) and prescribes a single clear action.
2. Ambi3D Benchmark: Dataset Design and Annotation
Ambi3D provides a comprehensive benchmark for evaluating ambiguity detection with 22,081 instructions covering 703 ScanNet-derived 3D scenes.
Data Generation Pipeline
- Grounded Instructions (37.2%): LLM-converted ScanQA question–answer pairs to executable commands.
- Synthetic Ambiguities (34.1%): LLM generation via scene metadata and templates for each subtype of ambiguity.
- Hard Negatives (28.7%): Manually authored (by experts) unambiguous commands that appear ambiguous at first glance.
Annotation Protocol
- 12 trained annotators; three-way annotation for binary ambiguity and subtype.
- Only samples with unanimous agreement are kept for binary labels; sub-type by majority vote.
- Outsamples with <2 votes per type are discarded.
Dataset Statistics
| Split | Scenes | Instructions | Unambiguous | Ambiguous | Inst. | Action | Attr. | Spatial |
|---|---|---|---|---|---|---|---|---|
| Training | 649 | 19,950 | 9,407 | 10,543 | ~48% | ~20% | ~19% | ~14% |
| Test | 54 | 2,131 | 1,073 | 1,058 | ~46% | ~20% | ~19% | ~15% |
| Total | 703 | 22,081 | 10,480 | 11,601 | 5,333 | 2,302 | 2,216 | 1,750 |
- Instruction lengths: mean 8.08 words; substantial spread (28% short, 22% long).
Scene Preprocessing
- Point cloud reconstruction .
- BEV image creation .
- Adaptive selection for keyframes.
- Multi-view 2D object detection (Grounding DINO) using parsed query .
- Ray-based clustering to form candidate object groups using proximity, angle, and area thresholds.
- Top-K candidate selection (typically ) by group confidence scores for downstream reasoning.
3. Baselines and Empirical Evaluation
Multiple large vision-language and 3D LLM architectures are adapted for zero-shot ambiguity detection. Each is assessed via a rule-based extraction of binary ambiguity from natural language outputs.
Compared Methods
- 3D-LLM (NeurIPS’23)
- Chat-Scene (NeurIPS’24)
- Video-3D LLM (CVPR’25)
- LSceneLLM (CVPR’25)
- LLaVA-3D (ICCV’25)
- AmbiVer (proposed)
Quantitative Comparison
| Method | Acc | Prec | Rec | F1 | Inst. | Attr. | Spatial | Action | Unamb. |
|---|---|---|---|---|---|---|---|---|---|
| 3D-LLM | 49.16 | 56.93 | 13.28 | 21.54 | 13.6 | 14.9 | 11.7 | 12.3 | 88.9 |
| Chat-Scene | 47.73 | 50.75 | 17.58 | 26.11 | 18.9 | 11.8 | 17.1 | 20.3 | 81.1 |
| Video-3D | 50.10 | 60.65 | 14.31 | 23.16 | 12.9 | 35.8 | 7.0 | 2.4 | 89.7 |
| LSceneLLM | 48.78 | 58.04 | 9.07 | 15.70 | 9.4 | 11.0 | 6.1 | 8.9 | 92.7 |
| LLaVA-3D | 64.21 | 63.93 | 73.17 | 68.24 | 59.6 | 95.6 | 62.9 | 91.0 | 54.3 |
| AmbiVer | 81.29 | 84.23 | 79.23 | 81.65 | 80.1 | 84.0 | 75.1 | 75.6 | 83.6 |
Failure analyses indicate off-the-shelf models exhibit pronounced “unambiguous bias” (low recall on ambiguous; inflated precision on clear cases) or “ambiguous bias” (misclassifying clear instructions). No baseline except LLaVA-3D achieves zero-shot F1 > 30%. AmbiVer demonstrates a substantial margin in both zero-shot and supervised settings.
4. AmbiVer: Two-Stage Ambiguity Detection Architecture
AmbiVer operationalizes ambiguity adjudication as a two-stage pipeline: an explicit evidence collection (“Perception Engine”) followed by cross-modal reasoning (“Reasoning Engine”).
Stage 1: Perception Engine (Visual Evidence Collection)
- Instruction Decoupling: Leveraging spaCy, parse into {action, query , attributes, relations}.
- Global Feature Acquisition: Reconstruct 3D point cloud and generate BEV image.
- Detailed Feature Acquisition:
- Adaptive selection of keyframes.
- Per-keyframe open-vocabulary detection (Grounding DINO) with .
- Ray-based fusion merges multi-view detections into groups by spatial and angular consistency.
- Reliability scoring:
- Candidate selection: take top groups, representative views by .
Stage 2: Reasoning Engine (Vision–Language Judgment)
- Dossier Construction: Bundle , its parse, BEV map, and cropped candidate evidence.
- Zero-shot VLM Adjudication: Use Qwen-3-VL, prompted with the assembled dossier and ambiguity criteria, to return a structured verdict:
1 2 3 4 5 6
{ "label": "Ambiguous"/"Unambiguous", "types": [...], "explanation": "...", "clarification_query": "..." } - (Optional) Fine-Tuning: If trained end-to-end on Ambi3D with LoRA, the loss is:
AmbiVer’s full algorithmic flow is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
function AmbiVer(V, E, T):
# 1. Perception Engine
{action, Q_t, attrs, rels} ← parse_instruction(T)
P ← reconstruct_point_cloud(V, E)
I_bev ← project_to_bev(P)
keyframes ← adaptive_keyframe_selection(V, E, N_target=100)
D ← {}
for each (I_v, pose_v) in keyframes:
D += grounding_dino(I_v, Q_t)
G ← ray_based_fusion(D, ε_d, θ_a, σ_s)
C ← select_top_K(G, K=6) # includes S_k, representative crops
# 2. Reasoning Engine
Dossier ← bundle_evidence(T, {action,Q_t,attrs,rels}, I_bev, C)
verdict ← zero_shot_vlm(Dossier)
return verdict |
5. Experimental Results and Ablation
Evaluation Setup
- Hardware: Tesla V100 / A100 GPUs (4 × 32 GB).
- Key parameters: , m, , , , px.
- VLM: Qwen-3-VL with temperature=0.
- Fine-tuning: 3 epochs, AdamW lr=, cosine schedule, wd=0.01, warmup=3%, rank , , dropout=0.1.
Main Results
- AmbiVer zero-shot achieves accuracy and F1, outperforming all baselines and ablation variants.
- Strong gains of – pts F1 compared to best zero-shot, and pts F1 over best supervised baseline.
Ablation Studies
| Ablation Case | Acc | F1 |
|---|---|---|
| w/o parsing | 62.06% | 65.62% |
| w/o adaptive keyframes | 65.04% | 54.56% |
| w/o 3D fusion | 58.59% | 38.42% |
| w/o representative view weights | 79.93% | 80.31% |
| Full pipeline (Ours) | 81.29% | 81.65% |
| Reasoning Engine Ablations | Acc | F1 |
|---|---|---|
| w/o BEV map | 56.95% | 70.57% |
| w/o local evidence | 60.71% | 53.56% |
| w/o all visual info | 53.99% | 56.91% |
| Full reasoning (Ours) | 81.29% | 81.65% |
Qualitative Case Studies
- “Pick up the backpack”: Unambiguous, single candidate.
- “Pass me the trash can”: Ambiguous (Instance), two plausible candidates.
- “Please handle the bicycle”: Ambiguous (Action), multiple plausible actions.
- “Adjust the pillow on the couch”: Mixed spatial/action ambiguity, VLM generates clarification queries.
6. Open Problems and Future Directions
Grounded ambiguity detection is established as a critical upstream safety check in embodied agents and robotics, providing an objective, evidence-based adjudication step before interpreting or executing 3D language commands.
Limitations
- Reliance on pre-trained open-vocabulary detectors (e.g., Grounding DINO) is a vulnerability under occlusion and unseen categories.
- Dependence on prompt design exposes reasoning to rare edge-case misparses.
- The framework yields only a binary verdict; it does not currently integrate active clarification dialogue or uncertainty awareness.
Future Directions
- Development of multi-turn clarification dialogue agents, leveraging the VLM verdict explanations.
- Integration of uncertainty-aware planning to mediate between execution and user query.
- Adaptation to dynamic/outdoor scenes, temporal ambiguity.
- Joint or end-to-end training for the perception and reasoning stages, potentially unlocking differentiable evidence pathways.
Open-vocabulary 3D instruction ambiguity detection thus forms a research agenda at the intersection of vision-language understanding, embodied AI safety, and pragmatic language interpretation (Ding et al., 9 Jan 2026).