Papers
Topics
Authors
Recent
Search
2000 character limit reached

DL3DV-Benchmark: 3D Vision Evaluation Suite

Updated 31 May 2026
  • DL3DV-Benchmark is a suite of academic benchmarks for assessing 3D vision tasks, including scene editing and visual attention evaluation.
  • DL3DV-Edit-Bench quantitatively tests text-driven 3D scene edits using multi-view, unposed images and metrics like CLIP t2i similarity, C-FID, and C-KID.
  • The original DL3DV-Benchmark (UBC3Deye) evaluates stereoscopic saliency with eye-tracking data, using sAUC, NSS, and KLD to rank visual attention models.

DL3DV-Benchmark refers to a suite of academic benchmarks derived from the DL3DV data infrastructure for evaluating vision models on large-scale real-world 3D content. Two distinct benchmarks operate under this naming lineage: (1) DL3DV-Edit-Bench, a protocol for quantitatively assessing 3D scene editing from sparse, unposed images; and (2) the original DL3DV-Benchmark (also known as the UBC3Deye dataset and 3D saliency benchmark), designed for evaluating 2D and 3D visual attention models on stereoscopic video with ground-truth human gaze annotations. Both serve as standard platforms for rigorous, reproducible comparison of algorithmic performance on challenging 3D-centric tasks (Liu et al., 31 Dec 2025, Banitalebi-Dehkordi et al., 2018).

1. Dataset Origin and High-Level Purpose

DL3DV-Edit-Bench is constructed atop the DL3DV-10K dataset (“DL3DV test split”), a large-scale real-world scene collection offering diverse indoor and outdoor environments with complex geometry, occlusions, and lighting variations (Ling et al., 2023, Liu et al., 31 Dec 2025). It enables quantitative and qualitative assessment of 3D neural editing methods that operate on sparse, view-inconsistent image inputs.

The original DL3DV-Benchmark (UBC3Deye) targets the evaluation of visual attention models (VAMs) on stereoscopic 3D video content. It provides a controlled, factorially-designed collection of eye-tracking data and video for developing and ranking saliency detection algorithms under naturalistic 3D viewing conditions (Banitalebi-Dehkordi et al., 2018).

2. Dataset Construction and Content

DL3DV-Edit-Bench

  • Scenes: 20 sampled from the DL3DV test split, jointly spanning complex indoor/outdoor layouts, rich object compositions, and significant nontrivial depth structure.
  • Scene selection criteria: Random subset constrained to maximize diversity in environment, object count, and geometric complexity.
  • Edit taxonomy: Four text-driven edit categories:
    • Add: Local insertion of new entities (e.g., “add a cactus garden”).
    • Remove: Deletion of objects/regions (“remove the bench”).
    • Modify: Attribute/matter/appearance changes (e.g., “make the door red”).
    • Global: Scene-wide style, lighting, or exposure transformations (“make the scene look foggy”).
  • Prompts per scene: 5, manually vetted; overall, 25 edits per edit category, 100 unique editing instances.
  • Supervision: For training, multi-view-consistent "fake" edited images generated via (a) SAM2 video-segmentation providing per-object, multi-view masks, and (b) region-wise recoloring with augmentations (ColorJitter, gamma, RGB permutation, grayscale). This ensures alignment across all views during edit prediction (Liu et al., 31 Dec 2025).

Original DL3DV-Benchmark (UBC3Deye)

  • Video corpus: 61 stereoscopic (3D) video sequences (plus right-eye 2D versions), shot at Full HD (1920×1080), 30 fps, ≈10 s duration each.
  • Subjects: 24 human viewers (12M, 12F), free-viewing protocol; each subject watched both 2D and 3D versions in counter-balanced order.
  • Scene content: Covers a complete 2⁴ factorial of four low-level factors (intensity, motion, depth, texture) (“low”/“high” for each). High-level content: ~52% contain people, ~40% vehicles.
  • Eye-tracking: SMI iView X RED, 250 Hz, gaze accuracy ±0.4°; left-eye fixations are disparity-corrected to right view.
  • Fixation maps: For each frame, Gaussian-blurred aggregations of fixations normalized to model the foveal drop-off (σ = 60 px for a 1° visual angle), forming per-frame ground-truth Fixation Density Maps (FDMs) in both 2D and 3D modes (Banitalebi-Dehkordi et al., 2018).

3. Evaluation Protocols and Metrics

DL3DV-Edit-Bench

  • Inputs: Multi-view unposed images and text edit instruction.
  • Outputs: Edited novel-view renders.
  • Metrics:

    CLIPt2i(I,T)=cos(ϕtext(T),ϕimg(I))\mathrm{CLIP}_{t2i}(I,T) = \cos(\phi_{\rm text}(T), \phi_{\rm img}(I))

    where ϕtext\phi_{\rm text}, ϕimg\phi_{\rm img} are CLIP encoders; higher = better semantic match to prompt. - Multi-view realism and consistency: - Scene-conditioned Fréchet Inception Distance (C-FID):

    C ⁣ ⁣FID=μrealμedit2+Tr(Σreal+Σedit2(ΣrealΣedit)1/2)\mathrm{C\!-\!FID} = \|\mu_{\rm real}-\mu_{\rm edit}\|^2 + \mathrm{Tr}\bigl(\Sigma_{\rm real}+\Sigma_{\rm edit} - 2(\Sigma_{\rm real}\Sigma_{\rm edit})^{1/2}\bigr)

    Lower = better. - Scene-conditioned Kernel Inception Distance (C-KID):

    C ⁣ ⁣KID=MMD2({ϕimg(Ireal)},{ϕimg(Iedit)})\mathrm{C\!-\!KID} = \mathrm{MMD}^2(\{\phi_{\rm img}(I_{\rm real})\},\{\phi_{\rm img}(I_{\rm edit})\})

    Lower = better. - Internal (not for benchmark): Chamfer-L1L_1 geometry loss between predicted Gaussian centers (multi-view alignment), not part of final scoring.

Original DL3DV-Benchmark (UBC3Deye)

  • Training/validation split: 24 videos (training, with FDMs); 37 videos (validation, ground truth withheld).

  • Evaluation metrics:

    • Area under ROC (AUC) and shuffled AUC (sAUC): Performance of saliency map as a fixation classifier; sAUC penalizes center-bias.
    • Normalized Scanpath Saliency (NSS):

    NSS=1Ni=1NSiμSσSFi\mathrm{NSS} = \frac{1}{N}\sum_{i=1}^{N} \frac{S_i-\mu_S}{\sigma_S} F_i

    where SS is the predicted saliency, FF the binary fixation map. - Kullback–Leibler divergence (KLD), correlation coefficient (CC), similarity (SIM), earth mover’s distance (EMD): Suite for comparing predicted saliency to FDM ground truth.

  • Overall ranking: Models are ranked by mean of sAUC, NSS, KLD position on validation, with per-video/frame scores averaged.

Metric Higher/Lower Main Use
sAUC, NSS Higher Saliency model accuracy
KLD Lower Saliency map divergence
C-FID, C-KID Lower 3D edit realism/consist.

4. Baselines, Model Results, and Key Findings

DL3DV-Edit-Bench

  • Methods compared:

    • GaussCtrl: Optimization-based
    • EditSplat: Optimization-based
    • NoPoSplat: Feed-forward (reconstruction, not explicit editing)
    • Edit3r: Feed-forward (instruction-based 3D editing)
  • Per-view inference time: Edit3r (0.51 s), NoPoSplat (0.61 s), GaussCtrl (325.5 s), EditSplat (584.5 s)
  • Performance summary (100 edits mean):
    • CLIPt2i_{t2i}: Edit3r = 0.266 (best), NoPoSplat = 0.253, EditSplat = 0.241, GaussCtrl = 0.227
    • C-FID: GaussCtrl = 135.0 (best), Edit3r = 171.3, EditSplat = 174.1, NoPoSplat = 180.6
    • C-KID: GaussCtrl = 0.091 (best), Edit3r = 0.116, EditSplat = 0.122, NoPoSplat = 0.125
  • Summary: Edit3r achieves best semantic alignment and competitive realism/consistency, while being orders of magnitude faster (Liu et al., 31 Dec 2025).

Original DL3DV-Benchmark

  • 3D models: Top performance by LBVS-3D (sAUC: 0.780, KLD: 0.129, NSS: 1.417).
  • 2D models: Marked performance drop on 3D saliency evaluation (sAUC ≈ 0.60–0.65).
  • Upper bound (“∞ humans”): sAUC ≈ 0.991, KLD ≈ 0.03, NSS ≈ 4.25.
  • Observations: High-level content modeling (faces, humans, vehicles) is advantageous; depth-augmented 2D models provide only minor improvements over vanilla 2D.
  • Headroom: Even best current models reach only ~78% of “infinite human” sAUC (Banitalebi-Dehkordi et al., 2018).

5. Usage and Submission Guidelines

DL3DV-Edit-Bench

  • Standardized evaluation: Methods use only the provided multi-view images and manual text prompts for edits. Output is a set of synthetic, novel-view renderings, scored per the metrics above.
  • Training supervision: Must use SAM2-masked, recolored multi-view edited images for learning edit alignment; direct ground-truth for edited scenes is not available.

Original DL3DV-Benchmark (UBC3Deye)

  • Access: Publicly available at http://ece.ubc.ca/~dehkordi/saliency.html.
  • Workflow: Download training set, develop model, submit validation predictions or source code, receive automated scoring and leaderboard placement.
  • Disclosure: Participation implies contribution of saliency maps for research use.
  • Model development recommendations: Incorporate both low-level (e.g., depth, motion) and high-level (semantic) cues; use official pre-processing and train/validation split to ensure comparability.

6. Significance and Research Impact

DL3DV-Benchmark platforms have catalyzed progress in comparative evaluation of 3D-aware vision models, bridging the gap between algorithmic advances (e.g., feed-forward 3D editors, novel 3D VAMs) and rigorous, reproducible measurement on real-world, large-scale 3D data. DL3DV-Edit-Bench in particular provides a practical framework for 3D scene editing research, including edit fidelity, semantic alignment, and multi-view consistency evaluation, under the constraints of limited, unposed views (Liu et al., 31 Dec 2025). The original DL3DV-Benchmark/UBC3Deye dataset remains a canonical testbed for saliency research in stereoscopic video, defining clear upper and lower performance bounds and helping to dissect the utility of 3D cues in human attention prediction (Banitalebi-Dehkordi et al., 2018). Together, these resources constitute critical infrastructure for future work in scalable, generalizable 3D vision.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DL3DV-Benchmark.