SysCON3D Benchmark
- SysCON3D Benchmark is a systematically controlled evaluation suite that tests multiview 3D consistency metrics in geometric foundation models.
- It introduces controlled corruptions, such as cross-scene mixtures and Gaussian noise, to differentiate truly consistent views from intentionally flawed ones.
- It combines classical COLMAP-based methods with neural strategies to quantify robustness via metrics like win rate, Cohen’s d, and Kendall’s τ.
SysCON3D is a systematically controlled benchmark for evaluating the robustness and reliability of multiview 3D consistency metrics, particularly in the context of end-to-end geometric foundation models (GFMs). It is designed to probe whether ground-truth-free 3D-consistency metrics can distinguish between truly consistent view sets—arising from one static 3D scene—and intentionally corrupted or mixed sets that should be rejected. SysCON3D also defines a suite of critical spatial-intelligence tasks and provides an automated, open-source toolkit for reproducible evaluation of 3D reconstruction, perception, and reasoning in both standard and challenging settings (Paul et al., 18 May 2026, Cong et al., 2 Jun 2025).
1. Benchmark Rationale and Scope
SysCON3D addresses a central reliability gap in current 3D consistency evaluation. Established neural metrics (e.g., MEt3R), designed to be reference-free and operate without ground truth, are vulnerable to failure cases where inputs contain artifacts, repeated frames, outlier images, or synthetic noise. These metrics can assign artificially high consistency scores to inconsistent or even nonsensical view sets due to their dependence on learned neural backbones, which can hallucinate plausible geometry on arbitrary input collections.
To provide systematic diagnosis, SysCON3D injects controlled inconsistencies of increasing severity—cross-scene mixtures, outlier frames, repeated views, and noise corruptions—into multi-view image sets. It then tests whether a given metric can correctly assign higher inconsistency scores to corrupted data than to clean samples. The benchmark systematically varies the number of views and forms challenging evaluation settings beyond real-scene-only benchmarks, thus revealing reliability failures not previously observable (Paul et al., 18 May 2026).
2. Construction and Experimental Design
SysCON3D is implemented using training viewpoints from 9 Mip-NeRF360 scenes. For each target view count , it constructs 9 clean (consistent) samples and 32 systematically corrupted samples:
- Cross-Scene Mixtures (SysCON3D-M):
- : Single outlier— views from one scene, one view from a foreign scene.
- : Controlled mixture—30% of views from foreign scenes.
- : Full random mixture—all views sampled i.i.d. from all scenes.
- Noise Corruptions (SysCON3D-N):
- Patched Gaussian: Four local patches per image with pixelwise noise.
- Full Gaussian: Every pixel replaced by independent , values clipped to .
All data are constructed using fixed random seeds to ensure reproducibility. For each configuration, multiple metrics and ablations are evaluated and compared against clean baselines (Paul et al., 18 May 2026).
3. Metric Families and Evaluation Protocols
SysCON3D supports a rich suite of evaluation metrics, each falling under two primary evaluation paradigms:
- Neural Ground-Truth-Free Metrics: Decomposed as backbone–residual–aggregation triplets:
- Backbones 0: 3D reconstruction models such as MASt3R, DUSt3R, VGGT, Fast3R, producing point clouds and camera parameters.
- Residuals 1: Feature-disagreement functions—warp-based (tracking feature consistency along correspondences) or point-consistency (summarizing across all visible views).
- Aggregation 2: Reducing residual distributions to scalars. SysCON3D test options include mean (MEt3R baseline), maximum mean discrepancy (MMD, IMQ kernel), and energy distance.
These combinations generate an ablation family, identifying variants that can achieve up to 3 robustness improvement over MEt3R, especially with IMQ or Energy aggregation (win rate rising from 23% to 71%).
- Classical Geometric Verification (COLMAP-Based): Scene-level consistency is scored using structure-from-motion (SfM) and multi-view stereo (MVS) pipelines:
- Registration rate: Fraction of images registered in SfM.
- Per-pixel depth agreement, geometric–photometric consistency (GPC): Validates depth alignment.
- Integrated consistency mass (ICM), coverage-weighted GPC (W-GPC): Aggregate consistency metrics that penalize registration or densification failure.
COLMAP-based metrics are inherently failure-aware, registering low or zero scores on unresolvable or highly inconsistent view sets (Paul et al., 18 May 2026).
4. Robustness and Human Alignment
SysCON3D quantifies metric robustness using Cohen’s 4 effect size between clean and corrupted groups, “win rate” (proportion of 5 where 6), and order concordance with corruption severity (Kendall’s τ, probabilistic pairwise concordance). Empirically:
- MEt3R: Only 23% overall win rate; often fails to separate corrupted groups, with negative 7 on strong corruptions.
- MASt3R-W-IMQ: Raising mean to IMQ aggregation achieves 71% overall win rate, with perfect separation on larger cross-scene mixtures/noise.
- COLMAP W-GPC, ICM: Achieve 890% or 100% win rate across all corruptions; reliably fail when presented with inconsistent inputs.
A structured human study confirms these trends. On DL3DV and Mip-NeRF360 scenes, COLMAP W-GPC achieves Spearman rank correlation 9 with human preferences, a 0–1 improvement over MEt3R (2 as low as 3). The best neural variant, MASt3R-W-IMQ, improves to 4 but remains below classical verification.
5. Diagnostic Insights and Failure Modes
SysCON3D reveals that contemporary neural 3D backbones (VGGT, MASt3R, DUSt3R, Fast3R) can hallucinate dense geometry, plausible point cloud overlap, and low residual feature error across spurious, repeated, or completely unrelated views, including pure Gaussian noise. For example, DUSt3R returns virtually the same geometric support for 70% of pairs in 5 random mixtures, a condition under which proper consistency should be zero.
No aggregation technique alone can remedy the susceptibility of neural backbones to such hallucinations; failure-aware classical methods are required for reliable, human-aligned evaluation (Paul et al., 18 May 2026).
6. Spatial-Intelligence Task Suite and Benchmark Toolkit
SysCON3D encompasses five core 3D spatial-intelligence tasks for modern geometric foundation models, with standardized datasets, preprocessing, and protocols (Cong et al., 2 Jun 2025):
| Task | Input/Output | Primary Metric(s) |
|---|---|---|
| Sparse-view depth estimation | 2–5 views 6 depth maps | AbsRel, 7 |
| Video depth estimation | 10–200 frames 8 coherent depths | AbsRel, 9 |
| Multi-view 3D reconstruction | 2–50 images 0 point cloud | Accuracy, Completeness, NC |
| Multi-view relative pose estimation | 1 images 2 Sim(3) poses | ATE, RPE3 |
| Novel-view synthesis | 2 images 4 target view | PSNR, SSIM, LPIPS |
Datasets span diverse domains (DTU, ScanNet, KITTI, ACID, Syndrone, etc.) and include both in-distribution and out-of-distribution settings. The toolkit automates scene selection, frame sampling, Sim(3) alignment, and metric computation—ensuring rigorous, comparable evaluation.
7. Benchmark Findings, Limitations, and Recommendations
SysCON3D reveals several structural insights:
- Neural metrics are intrinsically vulnerable to generating artifactual 3D consistency in the presence of noise, repetition, or unrelated views.
- Choice of residual aggregation (IMQ, Energy distance) improves discrimination up to 5, but cannot fully overcome backbone hallucination.
- Classical geometric verification is most reliable; W-GPC and ICM track human judgment robustly and signal registration or densification failure by dropping scores to zero.
- No single GFM backbone is universally optimal; feed-forward ViTs (VGGT, Fast3R) combine moderate efficiency and performance, while global-alignment and diffusion models offer tradeoffs in accuracy and runtime.
- Real-time 3D spatial intelligence remains out of reach due to high computational and memory demands for large-scale inputs.
The benchmark recommends reporting COLMAP W-GPC, ICM, and registration rate as standard metrics, resorting to neural metrics only when classical registration is unattainable and with the caveat of possible hallucinations. Future metric development should integrate learned priors with explicit verification or calibrated uncertainty, particularly for uncurated or adversarial data (Paul et al., 18 May 2026, Cong et al., 2 Jun 2025).
References
- "Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate" (Paul et al., 18 May 2026)
- "E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models" (Cong et al., 2 Jun 2025)