DL3DV-Benchmark: 3D Vision Evaluation Suite
- DL3DV-Benchmark is a suite of academic benchmarks for assessing 3D vision tasks, including scene editing and visual attention evaluation.
- DL3DV-Edit-Bench quantitatively tests text-driven 3D scene edits using multi-view, unposed images and metrics like CLIP t2i similarity, C-FID, and C-KID.
- The original DL3DV-Benchmark (UBC3Deye) evaluates stereoscopic saliency with eye-tracking data, using sAUC, NSS, and KLD to rank visual attention models.
DL3DV-Benchmark refers to a suite of academic benchmarks derived from the DL3DV data infrastructure for evaluating vision models on large-scale real-world 3D content. Two distinct benchmarks operate under this naming lineage: (1) DL3DV-Edit-Bench, a protocol for quantitatively assessing 3D scene editing from sparse, unposed images; and (2) the original DL3DV-Benchmark (also known as the UBC3Deye dataset and 3D saliency benchmark), designed for evaluating 2D and 3D visual attention models on stereoscopic video with ground-truth human gaze annotations. Both serve as standard platforms for rigorous, reproducible comparison of algorithmic performance on challenging 3D-centric tasks (Liu et al., 31 Dec 2025, Banitalebi-Dehkordi et al., 2018).
1. Dataset Origin and High-Level Purpose
DL3DV-Edit-Bench is constructed atop the DL3DV-10K dataset (“DL3DV test split”), a large-scale real-world scene collection offering diverse indoor and outdoor environments with complex geometry, occlusions, and lighting variations (Ling et al., 2023, Liu et al., 31 Dec 2025). It enables quantitative and qualitative assessment of 3D neural editing methods that operate on sparse, view-inconsistent image inputs.
The original DL3DV-Benchmark (UBC3Deye) targets the evaluation of visual attention models (VAMs) on stereoscopic 3D video content. It provides a controlled, factorially-designed collection of eye-tracking data and video for developing and ranking saliency detection algorithms under naturalistic 3D viewing conditions (Banitalebi-Dehkordi et al., 2018).
2. Dataset Construction and Content
DL3DV-Edit-Bench
- Scenes: 20 sampled from the DL3DV test split, jointly spanning complex indoor/outdoor layouts, rich object compositions, and significant nontrivial depth structure.
- Scene selection criteria: Random subset constrained to maximize diversity in environment, object count, and geometric complexity.
- Edit taxonomy: Four text-driven edit categories:
- Add: Local insertion of new entities (e.g., “add a cactus garden”).
- Remove: Deletion of objects/regions (“remove the bench”).
- Modify: Attribute/matter/appearance changes (e.g., “make the door red”).
- Global: Scene-wide style, lighting, or exposure transformations (“make the scene look foggy”).
- Prompts per scene: 5, manually vetted; overall, 25 edits per edit category, 100 unique editing instances.
- Supervision: For training, multi-view-consistent "fake" edited images generated via (a) SAM2 video-segmentation providing per-object, multi-view masks, and (b) region-wise recoloring with augmentations (ColorJitter, gamma, RGB permutation, grayscale). This ensures alignment across all views during edit prediction (Liu et al., 31 Dec 2025).
Original DL3DV-Benchmark (UBC3Deye)
- Video corpus: 61 stereoscopic (3D) video sequences (plus right-eye 2D versions), shot at Full HD (1920×1080), 30 fps, ≈10 s duration each.
- Subjects: 24 human viewers (12M, 12F), free-viewing protocol; each subject watched both 2D and 3D versions in counter-balanced order.
- Scene content: Covers a complete 2⁴ factorial of four low-level factors (intensity, motion, depth, texture) (“low”/“high” for each). High-level content: ~52% contain people, ~40% vehicles.
- Eye-tracking: SMI iView X RED, 250 Hz, gaze accuracy ±0.4°; left-eye fixations are disparity-corrected to right view.
- Fixation maps: For each frame, Gaussian-blurred aggregations of fixations normalized to model the foveal drop-off (σ = 60 px for a 1° visual angle), forming per-frame ground-truth Fixation Density Maps (FDMs) in both 2D and 3D modes (Banitalebi-Dehkordi et al., 2018).
3. Evaluation Protocols and Metrics
DL3DV-Edit-Bench
- Inputs: Multi-view unposed images and text edit instruction.
- Outputs: Edited novel-view renders.
- Metrics:
- Semantic alignment: CLIP text-to-image (t2i) similarity,
where , are CLIP encoders; higher = better semantic match to prompt. - Multi-view realism and consistency: - Scene-conditioned Fréchet Inception Distance (C-FID):
Lower = better. - Scene-conditioned Kernel Inception Distance (C-KID):
Lower = better. - Internal (not for benchmark): Chamfer- geometry loss between predicted Gaussian centers (multi-view alignment), not part of final scoring.
Original DL3DV-Benchmark (UBC3Deye)
Training/validation split: 24 videos (training, with FDMs); 37 videos (validation, ground truth withheld).
Evaluation metrics:
- Area under ROC (AUC) and shuffled AUC (sAUC): Performance of saliency map as a fixation classifier; sAUC penalizes center-bias.
- Normalized Scanpath Saliency (NSS):
where is the predicted saliency, the binary fixation map. - Kullback–Leibler divergence (KLD), correlation coefficient (CC), similarity (SIM), earth mover’s distance (EMD): Suite for comparing predicted saliency to FDM ground truth.
Overall ranking: Models are ranked by mean of sAUC, NSS, KLD position on validation, with per-video/frame scores averaged.
| Metric | Higher/Lower | Main Use |
|---|---|---|
| sAUC, NSS | Higher | Saliency model accuracy |
| KLD | Lower | Saliency map divergence |
| C-FID, C-KID | Lower | 3D edit realism/consist. |
4. Baselines, Model Results, and Key Findings
DL3DV-Edit-Bench
Methods compared:
- GaussCtrl: Optimization-based
- EditSplat: Optimization-based
- NoPoSplat: Feed-forward (reconstruction, not explicit editing)
- Edit3r: Feed-forward (instruction-based 3D editing)
- Per-view inference time: Edit3r (0.51 s), NoPoSplat (0.61 s), GaussCtrl (325.5 s), EditSplat (584.5 s)
- Performance summary (100 edits mean):
- CLIP: Edit3r = 0.266 (best), NoPoSplat = 0.253, EditSplat = 0.241, GaussCtrl = 0.227
- C-FID: GaussCtrl = 135.0 (best), Edit3r = 171.3, EditSplat = 174.1, NoPoSplat = 180.6
- C-KID: GaussCtrl = 0.091 (best), Edit3r = 0.116, EditSplat = 0.122, NoPoSplat = 0.125
- Summary: Edit3r achieves best semantic alignment and competitive realism/consistency, while being orders of magnitude faster (Liu et al., 31 Dec 2025).
Original DL3DV-Benchmark
- 3D models: Top performance by LBVS-3D (sAUC: 0.780, KLD: 0.129, NSS: 1.417).
- 2D models: Marked performance drop on 3D saliency evaluation (sAUC ≈ 0.60–0.65).
- Upper bound (“∞ humans”): sAUC ≈ 0.991, KLD ≈ 0.03, NSS ≈ 4.25.
- Observations: High-level content modeling (faces, humans, vehicles) is advantageous; depth-augmented 2D models provide only minor improvements over vanilla 2D.
- Headroom: Even best current models reach only ~78% of “infinite human” sAUC (Banitalebi-Dehkordi et al., 2018).
5. Usage and Submission Guidelines
DL3DV-Edit-Bench
- Standardized evaluation: Methods use only the provided multi-view images and manual text prompts for edits. Output is a set of synthetic, novel-view renderings, scored per the metrics above.
- Training supervision: Must use SAM2-masked, recolored multi-view edited images for learning edit alignment; direct ground-truth for edited scenes is not available.
Original DL3DV-Benchmark (UBC3Deye)
- Access: Publicly available at http://ece.ubc.ca/~dehkordi/saliency.html.
- Workflow: Download training set, develop model, submit validation predictions or source code, receive automated scoring and leaderboard placement.
- Disclosure: Participation implies contribution of saliency maps for research use.
- Model development recommendations: Incorporate both low-level (e.g., depth, motion) and high-level (semantic) cues; use official pre-processing and train/validation split to ensure comparability.
6. Significance and Research Impact
DL3DV-Benchmark platforms have catalyzed progress in comparative evaluation of 3D-aware vision models, bridging the gap between algorithmic advances (e.g., feed-forward 3D editors, novel 3D VAMs) and rigorous, reproducible measurement on real-world, large-scale 3D data. DL3DV-Edit-Bench in particular provides a practical framework for 3D scene editing research, including edit fidelity, semantic alignment, and multi-view consistency evaluation, under the constraints of limited, unposed views (Liu et al., 31 Dec 2025). The original DL3DV-Benchmark/UBC3Deye dataset remains a canonical testbed for saliency research in stereoscopic video, defining clear upper and lower performance bounds and helping to dissect the utility of 3D cues in human attention prediction (Banitalebi-Dehkordi et al., 2018). Together, these resources constitute critical infrastructure for future work in scalable, generalizable 3D vision.