3D-POPE: Benchmarking Object Hallucination in 3D-LLMs
- 3D-POPE is a benchmark that evaluates object hallucination in 3D-grounded LLMs by measuring how reliably model predictions align with actual object presence in real indoor scene graphs.
- It employs a binary existence probing task with diverse negative sampling regimes—Random, Popular, and Adversarial—to diagnose model biases and over-affirmation.
- The benchmark informs advancements in grounding techniques by comparing models like 3D-LLM and 3D-VCD, highlighting significant reductions in hallucination rates with contrastive decoding.
3D-POPE is a benchmark expressly developed to evaluate object hallucination in 3D-grounded LLMs (3D-LLMs) operating over real 3D scenes. Designed for systematic diagnosis and comparative analysis, 3D-POPE quantifies how reliably a model’s predictions about object presence are grounded in actual 3D scene evidence rather than spurious language or vision priors. The benchmark draws conceptual lineage from earlier 2D hallucination tests but is distinguished by its focus on object-centric 3D scene graphs derived from real indoor scans, fine-grained negative sampling strategies, and its role as a public leaderboard for grounding evaluation in embodied AI (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).
1. Task Formulation and Objectives
3D-POPE frames hallucination assessment as a binary “existence probing” task over 3D scenes. For each probe, a model is presented with a 3D scene and an object category , and must answer the templated query:
“Is there a [c] in this scene?”
The ground truth is determined by semantic labels derived from 3D reconstructions. The evaluation focuses strictly on object presence, eschewing open-ended or attribute-based reasoning. The principal objectives are: (i) to quantify the rate at which a 3D-LLM hallucinates objects not present (false positives), (ii) to compare models across defined negative sampling complexities, and (iii) to provide a reproducible, leaderboard-driven protocol for measuring grounding improvement in 3D-embodied agents (Yang et al., 2024).
2. Dataset Composition and Scene Representation
3D-POPE leverages the validation split of the ScanNet dataset. Each test sample consists of a tuple (scene_id, object_class, answer), with semantic labels covering 200 fine-grained indoor classes (e.g., “chair,” “table,” “sink”). Object-centric 3D scene graphs are constructed for each scan, where is the object category and encodes geometric attributes: the centroid and extent . The benchmark ensures a strict 1:1 ratio of positive (object present) and negative (object absent) cases, yielding balanced accuracy and directly isolating model bias (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).
3. Sampling Strategies and Difficulty Regimes
Negative (absent-object) sampling in 3D-POPE employs a tripartite regimen to modulate diagnostic difficulty:
- Random Sampling: The absent class is selected uniformly at random from those not present in the scene. This serves as a basic control.
- Popular Sampling: Absent classes are chosen from those most frequent in the training set, challenging a model’s tendency to rely on global object priors.
- Adversarial Sampling: For each present class in the scene, the absent class that is most statistically likely to co-occur with is chosen. This probes for co-occurrence biases and pushes grounding to the limit.
These settings progressively increase the likelihood of hallucination, explicitly exposing frequency- and context-driven failure modes (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).
4. Evaluation Metrics
Five complementary metrics characterize model performance for each sampling split:
- Precision: 0
- Recall: 1
- F1-score: 2
- Accuracy: 3
- Yes-rate: Fraction of “yes” answers, i.e., 4
Here, 5, 6, 7, and 8 are true positives, false positives, true negatives, and false negatives, respectively. Elevated Yes-rate, especially on adversarial negatives, directly signals hallucination bias. “Hallucination rate” is equivalently 9, i.e., the fraction of over-affirmed absent classes (Yang et al., 2024).
5. Baseline Methods and Quantitative Benchmarking
3D-POPE features head-to-head comparison of 3D-grounded LLMs:
- 3D-LLM: Uses volumetric embeddings as LLM input.
- 3D-VisTA: Implements spatial-semantic alignment.
- LEO: Employs explicit chain-of-thought object grounding.
- 3D-VCD: Implements inference-time visual contrastive decoding over perturbed scene graphs (Ogunleye et al., 9 Apr 2026).
Performance across Random, Popular, and Adversarial splits is summarized below for key metrics:
| Model | Precision (Random) | F1 (Adversarial) | Accuracy (Popular) | Yes-rate (Adversarial) |
|---|---|---|---|---|
| 3D-LLM | 50.03 | 66.61 | 49.94 | 99.94 |
| 3D-VisTA | 50.12 | 51.15 | 49.49 | 52.99 |
| LEO | 51.95 | 59.78 | 47.27 | 80.45 |
| 3D-VCD | 62.16 | 67.32 | 54.00 | 87.82 |
3D-VCD achieves a relative improvement of up to 10 percentage points in precision and up to 18 points in accuracy across splits. On the adversarial set, Yes-rate for 3D-LLM remains above 99%, indicative of near-deterministic over-hallucination, while 3D-VCD reduces this to 87.8%, demonstrating effective suppression of language-prior-driven responses (Ogunleye et al., 9 Apr 2026).
6. Qualitative Insights and Diagnostic Analyses
Qualitative analysis reveals distinct pathological modes:
- Over-affirmation: High Yes-rate in adversarial settings where models affirm the existence of absent but contextually plausible objects.
- Under-grounding: Missed detections in cluttered or occluded environments, leading to false negatives.
- Contrastive decoding effectiveness: Perturbing scene graphs (e.g., semantic swaps, geometric corruption) at inference exposes and suppresses token outputs insensitive to true 3D evidence. For instance, 3D-LLM fails to detect a present “dining table,” rectified by 3D-VCD. Conversely, hallucinatory affirmations (e.g., a “desk” not present) are filtered by contrastive comparison with the distorted graph (Ogunleye et al., 9 Apr 2026).
These analyses corroborate that robust grounding in 3D-LLMs depends critically on explicit scene-graph structure and contrastive interrogation rather than reliance on frequency- or co-occurrence-based priors.
7. Implications, Scaling Laws, and Future Directions
The 3D-POPE benchmark has illuminated several core findings:
- Dense grounding data significantly diminishes hallucination rates: instruction tuning on large-scale synthetic 3D-text pairs yields >90% precision in the Random regime and retains ≈70% precision on hardest adversarial negatives—even without real ScanNet training (Yang et al., 2024).
- Precision monotonically declines as negative sampling becomes more adversarial, revealing an open challenge in fine-grained co-occurrence reasoning.
- Data scaling experiments demonstrate a law-like decrease in hallucination rate with larger grounded corpora, suggesting pathway for progressive reduction of error via dataset augmentation.
- Sim-to-real evaluation underscores the transferability of models trained on synthetic 3D-GRAND data to real-world scans with minimal dropoff in grounding reliability (Yang et al., 2024).
A plausible implication is that further scaling and diversification of densely annotated 3D data, coupled with inference-time grounding techniques exemplified by 3D-VCD, will continue to drive down hallucination and make embodied agents increasingly safe, interpretable, and robust for deployment in complex real-world spaces.