Papers
Topics
Authors
Recent
Search
2000 character limit reached

3D-POPE: Benchmarking Object Hallucination in 3D-LLMs

Updated 11 May 2026
  • 3D-POPE is a benchmark that evaluates object hallucination in 3D-grounded LLMs by measuring how reliably model predictions align with actual object presence in real indoor scene graphs.
  • It employs a binary existence probing task with diverse negative sampling regimes—Random, Popular, and Adversarial—to diagnose model biases and over-affirmation.
  • The benchmark informs advancements in grounding techniques by comparing models like 3D-LLM and 3D-VCD, highlighting significant reductions in hallucination rates with contrastive decoding.

3D-POPE is a benchmark expressly developed to evaluate object hallucination in 3D-grounded LLMs (3D-LLMs) operating over real 3D scenes. Designed for systematic diagnosis and comparative analysis, 3D-POPE quantifies how reliably a model’s predictions about object presence are grounded in actual 3D scene evidence rather than spurious language or vision priors. The benchmark draws conceptual lineage from earlier 2D hallucination tests but is distinguished by its focus on object-centric 3D scene graphs derived from real indoor scans, fine-grained negative sampling strategies, and its role as a public leaderboard for grounding evaluation in embodied AI (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).

1. Task Formulation and Objectives

3D-POPE frames hallucination assessment as a binary “existence probing” task over 3D scenes. For each probe, a model is presented with a 3D scene SS and an object category cc, and must answer the templated query:

“Is there a [c] in this scene?”

The ground truth is determined by semantic labels derived from 3D reconstructions. The evaluation focuses strictly on object presence, eschewing open-ended or attribute-based reasoning. The principal objectives are: (i) to quantify the rate at which a 3D-LLM hallucinates objects not present (false positives), (ii) to compare models across defined negative sampling complexities, and (iii) to provide a reproducible, leaderboard-driven protocol for measuring grounding improvement in 3D-embodied agents (Yang et al., 2024).

2. Dataset Composition and Scene Representation

3D-POPE leverages the validation split of the ScanNet dataset. Each test sample consists of a tuple (scene_id, object_class, answer), with semantic labels covering 200 fine-grained indoor classes (e.g., “chair,” “table,” “sink”). Object-centric 3D scene graphs G={oi=(Ci,ai)}i=1NG = \{o_i = (C_i, a_i)\}_{i=1}^N are constructed for each scan, where CiC_i is the object category and aiR6a_i \in \mathbb{R}^6 encodes geometric attributes: the centroid pi=(x,y,z)p_i = (x, y, z) and extent si=(w,h,d)s_i = (w, h, d). The benchmark ensures a strict 1:1 ratio of positive (object present) and negative (object absent) cases, yielding balanced accuracy and directly isolating model bias (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).

3. Sampling Strategies and Difficulty Regimes

Negative (absent-object) sampling in 3D-POPE employs a tripartite regimen to modulate diagnostic difficulty:

  • Random Sampling: The absent class is selected uniformly at random from those not present in the scene. This serves as a basic control.
  • Popular Sampling: Absent classes are chosen from those most frequent in the training set, challenging a model’s tendency to rely on global object priors.
  • Adversarial Sampling: For each present class c+c^+ in the scene, the absent class cc^- that is most statistically likely to co-occur with c+c^+ is chosen. This probes for co-occurrence biases and pushes grounding to the limit.

These settings progressively increase the likelihood of hallucination, explicitly exposing frequency- and context-driven failure modes (Ogunleye et al., 9 Apr 2026, Yang et al., 2024).

4. Evaluation Metrics

Five complementary metrics characterize model performance for each sampling split:

  • Precision: cc0
  • Recall: cc1
  • F1-score: cc2
  • Accuracy: cc3
  • Yes-rate: Fraction of “yes” answers, i.e., cc4

Here, cc5, cc6, cc7, and cc8 are true positives, false positives, true negatives, and false negatives, respectively. Elevated Yes-rate, especially on adversarial negatives, directly signals hallucination bias. “Hallucination rate” is equivalently cc9, i.e., the fraction of over-affirmed absent classes (Yang et al., 2024).

5. Baseline Methods and Quantitative Benchmarking

3D-POPE features head-to-head comparison of 3D-grounded LLMs:

Performance across Random, Popular, and Adversarial splits is summarized below for key metrics:

Model Precision (Random) F1 (Adversarial) Accuracy (Popular) Yes-rate (Adversarial)
3D-LLM 50.03 66.61 49.94 99.94
3D-VisTA 50.12 51.15 49.49 52.99
LEO 51.95 59.78 47.27 80.45
3D-VCD 62.16 67.32 54.00 87.82

3D-VCD achieves a relative improvement of up to 10 percentage points in precision and up to 18 points in accuracy across splits. On the adversarial set, Yes-rate for 3D-LLM remains above 99%, indicative of near-deterministic over-hallucination, while 3D-VCD reduces this to 87.8%, demonstrating effective suppression of language-prior-driven responses (Ogunleye et al., 9 Apr 2026).

6. Qualitative Insights and Diagnostic Analyses

Qualitative analysis reveals distinct pathological modes:

  • Over-affirmation: High Yes-rate in adversarial settings where models affirm the existence of absent but contextually plausible objects.
  • Under-grounding: Missed detections in cluttered or occluded environments, leading to false negatives.
  • Contrastive decoding effectiveness: Perturbing scene graphs (e.g., semantic swaps, geometric corruption) at inference exposes and suppresses token outputs insensitive to true 3D evidence. For instance, 3D-LLM fails to detect a present “dining table,” rectified by 3D-VCD. Conversely, hallucinatory affirmations (e.g., a “desk” not present) are filtered by contrastive comparison with the distorted graph (Ogunleye et al., 9 Apr 2026).

These analyses corroborate that robust grounding in 3D-LLMs depends critically on explicit scene-graph structure and contrastive interrogation rather than reliance on frequency- or co-occurrence-based priors.

7. Implications, Scaling Laws, and Future Directions

The 3D-POPE benchmark has illuminated several core findings:

  • Dense grounding data significantly diminishes hallucination rates: instruction tuning on large-scale synthetic 3D-text pairs yields >90% precision in the Random regime and retains ≈70% precision on hardest adversarial negatives—even without real ScanNet training (Yang et al., 2024).
  • Precision monotonically declines as negative sampling becomes more adversarial, revealing an open challenge in fine-grained co-occurrence reasoning.
  • Data scaling experiments demonstrate a law-like decrease in hallucination rate with larger grounded corpora, suggesting pathway for progressive reduction of error via dataset augmentation.
  • Sim-to-real evaluation underscores the transferability of models trained on synthetic 3D-GRAND data to real-world scans with minimal dropoff in grounding reliability (Yang et al., 2024).

A plausible implication is that further scaling and diversification of densely annotated 3D data, coupled with inference-time grounding techniques exemplified by 3D-VCD, will continue to drive down hallucination and make embodied agents increasingly safe, interpretable, and robust for deployment in complex real-world spaces.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D-POPE.