Asymmetric Dual 3DGS Rendering

Updated 30 June 2025

Asymmetric Dual 3DGS is a neural rendering framework that employs dual-model training with complementary masking to robustly reconstruct 3D scenes from uncurated images.
The framework leverages divergent hard and soft masking strategies alongside mutual consistency constraints to reduce artifacts and isolate stable scene geometry.
Empirical evaluations demonstrate improvements in PSNR, SSIM, and LPIPS over traditional methods, with an EMA variant reducing training time by approximately 33%.

Asymmetric Dual 3DGS (Asymmetric Dual 3D Gaussian Splatting) denotes a robust neural rendering framework developed for high-fidelity 3D scene reconstruction from in-the-wild photographic imagery, particularly addressing the challenges introduced by inconsistent illumination, occlusions, and transient distractors. Diverging from earlier single-network or symmetric dual approaches, Asymmetric Dual 3DGS leverages complementary training strategies, masking mechanisms, and a dual-model or proxy-model setup to enhance reliability and artifact suppression in reconstructed scenes (Li et al., 4 Jun 2025).

1. Problem Setting and Motivation

The reconstruction of 3D scenes from uncurated visual data is prone to anomaly artifacts—incorrectly reconstructed elements caused by occlusions, moving objects, or changes in lighting. Traditional 3D Gaussian Splatting (3DGS) models and prior neural radiance field (NeRF)-style frameworks often resort to heuristic artifact rejection or weak supervision. However, these artifacts are observed to be stochastic: their locations and forms change between runs of the same model, given slight changes in experimental conditions (e.g., random seeds, order of data presentation). Asymmetric Dual 3DGS systematically exploits this stochasticity, positing that persistently reconstructed geometry across independent models is more likely to reflect genuine scene structure, while divergent features are likely unreliable. The framework is designed to maximize consensus on stable geometry and minimize the risk of models converging to correlated error modes or 'confirmation bias.'

2. Core Architecture and Training Scheme

The canonical Asymmetric Dual 3DGS framework involves the parallel training of two 3D Gaussian Splatting models, denoted $\mathbb{G}_1$ and $\mathbb{G}_2$ , on the same dataset with independently shuffled data and initialization. The core objective function includes:

Masked reconstruction losses for each model using model-specific masks $\mathbf{M}_h$ (hard) and $\mathbf{M}_s$ (soft):

$\mathcal{L}_{r1}^{\mathbf{M}} = \mathcal{L}_\text{recon}(\tilde{\mathbf{I}}_1^{\mathbb{G}_1}, \mathbf{I}_1, \mathbf{M}_h),\quad \mathcal{L}_{r2}^{\mathbf{M}} = \mathcal{L}_\text{recon}(\tilde{\mathbf{I}}_2^{\mathbb{G}_2}, \mathbf{I}_2, \mathbf{M}_s)$

Where

$\mathcal{L}_\text{recon}(\tilde{\mathbf{I}}, \mathbf{I}, \mathbf{M}) = \lambda \cdot \mathrm{DSSIM}(\mathbf{M} \odot \tilde{\mathbf{I}}, \mathbf{M} \odot \mathbf{I}) + (1-\lambda)\|\mathbf{M} \odot (\tilde{\mathbf{I}} - \mathbf{I})\|_1$

with $\odot$ denoting element-wise multiplication and $\lambda$ balancing the loss terms.

Mutual consistency constraints encourage the two models to agree on predicted frame-independent (intrinsic) renderings:

$\mathcal{L}_{m1} = \|\hat{\mathbf{I}}_1^{\mathbb{G}_2} - \hat{\mathbf{I}}_1^{\mathbb{G}_1}\|_1,\quad \mathcal{L}_{m2} = \|\hat{\mathbf{I}}_2^{\mathbb{G}_1} - \hat{\mathbf{I}}_2^{\mathbb{G}_2}\|_1$

This term operates on renderings with fixed color embeddings to focus the agreement on static structure only.

Total loss formulation:

$\mathcal{L} = \mathcal{L}_{r1}^{\mathbf{M}_h} + \mathcal{L}_{r2}^{\mathbf{M}_s} + \lambda_m (\mathcal{L}_{m1} + \mathcal{L}_{m2}) + \lambda_\text{mask}\mathcal{L}_\text{mask}$

The optimization path is deliberately asymmetric due to the use of divergent masking strategies (see Section 3). This design reduces the likelihood of the two models simultaneously reproducing the same failure case.

3. Divergent Masking Strategies

A distinguishing aspect is the use of complementary masking approaches:

Multi-cue Adaptive Mask ( $\mathbf{M}_h$ ): Crafted through ensemble cues—semantic segmentation (e.g., SAM), stereo correspondence (e.g., COLMAP matches), pixel-level residuals, and feature-level differences (DINOv2 features). Semantic segments with few stereo matches and high residuals are flagged as distractors; only regions most likely to be static structure are positively masked.
- The process involves computing, for each segment, the stereo match ratio and two types of residuals, then labeling as distractor if thresholds are exceeded.
- $\mathbf{M}_h$ is binary (1: static, 0: distractor).
Self-supervised Soft Mask ( $\mathbf{M}_s$ ): A learnable, continuous-valued mask initialized as all ones. It is optimized using a feature-level similarity loss:

$\mathcal{L}_\text{mask} = \|\mathbf{M}_s - f_\text{interp}(\text{CosineSimilarity}(\mathbf{F}, \tilde{\mathbf{F}}))\|_1$

where $f_\text{interp}$ interpolates cosine similarities into a soft target mask. No ground-truth mask is required.

By assigning these masks to different models, and by enforcing mutual consistency solely on the static (i.e., mask-agreed) regions, the framework biases learning toward the persistent scene geometry.

4. Dynamic EMA Proxy Variant

To address the computational cost of dual-model training, a Dynamic EMA Proxy variant is introduced. Here, only one 3DGS model is actively trained; the second ("proxy") is maintained as an Exponential Moving Average (EMA) of the first: $\mathbb{G}_\text{EMA}^t = \beta \cdot \mathbb{G}_\text{EMA}^{t-1} + (1-\beta) \cdot \mathbb{G}_1^t$

This design requires careful synchronization of Gaussian additions, splits, and deletions, as the sparse 3DGS data structure evolves during training. To preserve the essential divergence, the masking strategy alternates randomly between hard and soft masks each training step:

On odd steps, the model uses $\mathbf{M}_h$ ; on even, $\mathbf{M}_s$ .
The mutual consistency is still enforced between the model and its EMA proxy.

The EMA approach achieves comparable suppression of artifacts and consensus as the dual-model variant, with approximately one-third less training time.

5. Empirical Results and Performance

The approach is extensively evaluated on real-world datasets:

Benchmarks include:
- NeRF On-the-go: Outdoor and indoor scenes with substantial occlusions and transients.
- RobustNeRF: Indoor imagery with synthetic foreground clutter.
- PhotoTourism: World landmarks featuring crowds and varied lighting.
Key empirical findings:
- Quantitative: Asymmetric Dual 3DGS achieves higher PSNR, SSIM, and lower LPIPS than previous baselines, particularly in high-occlusion cases. For example, on NeRF On-the-go with "high occlusion," the dual model attains 24.34 PSNR, outperforming all compared methods; the EMA variant achieves 24.12 PSNR with greater efficiency.
- Efficiency: The EMA variant reduces training time by about 33% compared to the full dual-model mode and is substantially faster (minutes per scene versus hours) than prior methods when tested on PhotoTourism.
- Ablation studies: Both masking methods contribute to artifact suppression, and combining them confers further gains; removing either masking or mutual consistency degrades performance.
- Qualitative: Renderings demonstrate strong removal of ghosted artifacts and improved recovery of static geometry.

6. Applications, Implications, and Limitations

The main application domains of Asymmetric Dual 3DGS are:

Cultural heritage digitization: Robust reconstruction from crowdsourced or internet imagery contaminated by tourists, occlusions, or dynamic environmental effects.
Urban mapping and AR/VR content creation: Scene models robust to moving vehicles, pedestrians, and weather variations.
General photogrammetry: High-throughput pipeline for transforming uncurated image sets into accurate 3D assets.

Implications for research practice include the demonstration that ensemble or dual-path training strategies, when combined with carefully dissymmetric loss computation, can systematically filter out 'noise' without direct supervision. The use of a dynamic EMA proxy introduces a scalable pattern for efficient model consensus without full duplication of optimization effort.

Limitations observed include the shared appearance modeling across model twins, which constrains the diversity of artifact rejection. Future developments may involve distinct appearance subspaces and further adaptation to rapidly changing or non-static scenes.

Summary Table: Comparative Properties of Asymmetric Dual 3DGS

Component	Function	Benefit
Dual-Model (Full)	Parallel training with different masks	Robust consensus, artifact suppression
EMA Proxy Variant	Single model, EMA-updated proxy	Efficiency, comparable robustness
Multi-cue Hard Mask	Rule-based static region masking	High-precision distractor rejection
Self-supervised Mask	Feature-based soft mask (learned)	Adaptive, data-driven filtering
Mutual Consistency	Agreement loss on static scene regions	Suppresses model divergence on structure

Asymmetric Dual 3DGS provides a principled framework for stable, artifact-resistant reconstruction from challenging real-world imagery, leveraging stochastic artifact variance and divergent learning signals to build consensus on true scene geometry. The methodology is extensible to other domains where persistent noise or spurious signals interfere with learning from uncurated data.

PDF Markdown Chat (Pro)

References (1)

Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting (2025)

Follow Topic

Get notified by email when new papers are published related to Asymmetric Dual 3DGS.