Symmetric Auto-regressive Online Restoration

Updated 31 December 2025

SAOR is a unified paradigm that restores and synthesizes high-fidelity novel views using symmetric dual-view constraints and auto-regressive inference.
It couples pretrained 3D Gaussian Splatting with conditional diffusion to achieve photorealistic rendering and context-aware asset harmonization.
Empirical evaluations show state-of-the-art performance in novel-view synthesis and asset insertion metrics, proving its value for autonomous driving simulation.

Symmetric Auto-regressive Online Restoration (SAOR) is a paradigm for high-fidelity scene synthesis and editing, designed to address persistent challenges in autonomous driving (AD) simulation: sparse coverage of rare, long-tail scenarios and the dual requirement of photorealistic rendering with structurally and visually coherent asset manipulation. SAOR achieves state-of-the-art performance by coupling ground-truth-guided dual-view restoration with auto-regressive lateral synthesis and inpainting-based harmonization, all embedded within a unified diffusion framework. The methodology leverages complementary 3D Gaussian Splatting (3DGS) and advanced conditional flow-matching diffusion strategies to generate spatio-temporally consistent novel views and robust context-aware asset insertions.

1. Motivation and Objectives

Data scarcity in AD, arising from rare-edge-case occurrences, motivates the development of simulators that can both expand coverage and preserve high fidelity. Conventional approaches fail to simultaneously provide photorealistic, spatio-temporally coherent novel view rendering and artifact-free, fine-grained traffic asset editing. Single-view diffusion restorers lack geometric groundings and require expensive supervision, while 3DGS renderers—though efficient and consistent on-axis—exhibit severe degradation in large-angle, off-axis novel view synthesis and cannot harmonize inserted objects in lighting and shadow. SAOR resolves these limitations by enforcing symmetric paired view constraints in a dual-view ground-truth restoration training regime, followed by auto-regressive propagation to synthesize consistent lateral views. The same restoration model is repurposed for training-free, context-aware masked inpainting to harmonize vehicle insertions, guaranteeing seamless shadow and lighting consistency (Liu et al., 25 Dec 2025).

2. Architectural Overview and Pipeline

The SAOR pipeline consists of four interlocking stages:

Pretrained 3D Gaussian Splatting (3DGS): Two independent 3DGS models are trained on ground-truth images for background scenes and foreground vehicles, respectively. At camera pose $C_0$ , the ground-truth image $I_0$ is available, while lateral shifts $\pm d$ yield rendered views $I_{+d} = \mathcal{G}(C_{+d})$ and $I_{-d} = \mathcal{G}(C_{-d})$ .
Diffusion-Model Training: Training triplets $(I_{-d}, I_0, I_{+d})$ are encoded using a variational autoencoder (VAE) to obtain latent representations $(z_{-d}, z_0, z_{+d})$ . Noise is injected into $z_0$ : $z_{0,t} = (1-t) z_0 + t \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ , and a flow-matching diffusion network $v_\theta$ is trained on concatenated conditionings $[z_{-d}; z_{0,t}; z_{+d}]$ to minimize $L = \mathbb{E}_{z_0,\epsilon,t} \| v_\theta([z_{-d}; z_{0,t}; z_{+d}], t) - (\epsilon - z_0) \|^2_2$ .
Auto-regressive Inference: Starting from the central anchor view $z_0$ , for each lateral step $r \in \{1,\ldots,R\}$ , the target $C_{r \cdot d}$ , anchor $C_{(r-1)d}$ , and coarse rendering $C_{(r+1)d}$ are determined. Noise-initialization occurs at $t = N_{start}$ : $z_{r \cdot d,N_{start}} = (1-\sigma_{N_{start}}) z_{(r-1)d} + \sigma_{N_{start}} \epsilon$ , followed by 50-step reverse diffusion from $N_{start}$ to $0$, conditioned on $[z_{(r-1)d}; z_{r\cdot d,t}; z_{(r+1)d}]$ . Decoding yields high-fidelity $\tilde{I}_{r \cdot d}$ views, building a globally consistent novel-view chain.
3DGS Refinement: Restored views $\tilde{I}_{r \cdot d}$ inform extra supervision for the 3DGS model. The total loss $L_{total} = L_{gt} + L_{novel}$ , with $L_{gt} = L_{rgb} + \lambda_1 L_{ssim} + \lambda_2 L_{depth}$ and $L_{novel} = L_{rgb} + \lambda_1 L_{ssim}$ , closes the loop on scene geometry.

3. Mathematical Formulation

3.1 Dual-view Restoration Objective

Given central ground-truth view $I_0$ and symmetric rendered images $I_{\pm d}$ , encoding yields latents $(z_{-d}, z_0, z_{+d})$ . Injecting noise $\epsilon \sim \mathcal{N}(0, I)$ at time $t$ leads to

$z_{0,t} = (1-t) z_0 + t \epsilon.$

The flow-matching network $v_\theta$ is trained via

$L_{restore} = \mathbb{E}_{z_0,\epsilon,t} \| v_\theta([z_{-d}; z_{0,t}; z_{+d}], t) - (\epsilon - z_0) \|^2_2.$

3.2 Auto-regressive Lateral View Generation

For each lateral generation step $r$ :

Anchor: $a = z_{(r-1)d}$
Coarse: $c = z_{(r+1)d}$

Noise initialization at $t = N_{start}$ with $z_{r \cdot d, N_{start}} = (1-\sigma_{N_{start}}) a + \sigma_{N_{start}} \epsilon$ . The reverse diffusion ODE is recursively solved:

$z_{r \cdot d, t-1} = z_{r \cdot d, t} + v_\theta([a; z_{r\cdot d,t}; c], t) \Delta t$

and finally decoded to produce $\tilde{I}_{r\cdot d}$ .

The 3DGS refinement uses combined ground-truth and novel-view supervision:

$L_{total} = L_{gt} + L_{novel}$

$L_{gt} = L_{rgb}(I_{gt}, \tilde{I}) + \lambda_1 L_{ssim}(I_{gt}, \tilde{I}) + \lambda_2 L_{depth}(D_{gt}, D_{pred}),$

$L_{novel} = L_{rgb}(I_{novel}, \tilde{I}) + \lambda_1 L_{ssim}(I_{novel}, \tilde{I}).$

4. Dual-view Constraints for Fine-Grained Detail Recovery

Single-view restoration is fundamentally ill-posed due to occlusions and view-dependent lighting. By constructing paired symmetric views at $\pm d$ , the system exploits additional scene context: complementary structures visible in one view but occluded in another. The flow-matching network is able to triangulate geometry and texture, adaptively synthesizing per-pixel features by weighing cues from both $z_{-d}$ and $z_{+d}$ . This consensus mechanism is responsible for the recovery of crisp lane markings, vehicle boundaries, and complex road textures across large lateral displacements.

5. Training-free Context-Aware Harmonization for Vehicle Insertion

Vehicle insertion is operationalized as a context-aware masked inpainting problem using the same diffusion backbone $v_\theta$ :

Given a pre-inserted 3DRealCar image $I_{insert}$ and binary mask $M$ , the inpainting uses flow-matching reverse diffusion with

$z'_{t-1} = z_t + v_\theta([z_{insert}; z_t; z_{insert}], t)\Delta t$

and background consistency enforced via

$z_{t-1} \leftarrow M z'_{t-1} + (1-M) z^{orig}_{t-1}$

where $z^{orig}$ is the latent of the original 3DGS rendering. Only the vehicle region adapts appearance for photometric consistency, while the background is invariant to restoration.

Final harmonization is achieved by tuning vehicle color $c_v$ and opacity $\alpha_v$ in the 3DRealCar model $\mathcal{G}_v$ to minimize

$L_{harm} = \| \tilde{I}_{insert} - \hat{I}_{insert} \|^2_2 + \mu L_{ssim}(\tilde{I}_{insert}, \hat{I}_{insert}),$

ensuring asset textures and shading match the harmonized diffusion output.

6. Key Implementation Details

SAOR utilizes:

Diffusion backbone: Flux.1-dev flow-matching model; conditional input via three VAE latents concatenated.
Fine-tuning: LoRA adapters, rank 128, trained on 4×A100 GPUs for 20,000 iterations.
VAE output: $256 \times 256$ latent resolution, decoded at unit scale.
Lateral inference: step size $d = 0.5$ m.
Denoising: 50 reverse diffusion steps, noise initialization $N_{start} = 10$ .
3DGS refinement: base StreetGS trained for 50,000 steps; road geometry preprocessing via ground-point filtering.
Vehicle insertion: identical 50-step diffusion schedule with RePaint-style inpainting.

7. Quantitative Results and Comparative Evaluation

SAOR demonstrates superior performance in novel-view synthesis and realistic asset insertion on Waymo data (40-frame scenes, $3$ m lateral shift):

Method	NTA-IoU (Cars)	NTL-IoU (Lanes)	FID (Realism)
StreetGS	0.498	53.19	130.75
FreeVS	0.505	53.26	104.23
ReconDreamer	0.539	54.58	93.56
ReconDreamer++*	0.566	56.89	75.22
Difix3D+	0.578	56.94	84.12
SAOR (Ours)	0.582	57.91	74.82

Ablation studies confirm the importance of partial noise initialization ( $N_{start} \approx 10$ ) and moderate lateral step ( $d = 0.5$ m) for optimal lane reconstruction. Vehicle insertion benchmarks indicate lowest FID (32.60) for context-aware inpainting and fine-tuning, outperforming 3DRealCar-only (41.27), Difix3D+ (53.64), and CosXL-Edit (46.54). Qualitative results show robust preservation of lane markings, accurate handling of occlusions, and seamless shadow/reflection harmonization for inserted vehicles.

Note: ReconDreamer++ utilizes HD-map and bounding-box conditioning, whereas SAOR achieves results with no extra input.

8. Significance and Applications

Symmetric Auto-regressive Online Restoration constitutes a unified dual-view diffusion training, auto-regressive inference, and zero-shot masked inpainting system for novel-view enhancement and realistic 3D asset insertion. Its integration into AD simulation delivers photorealism, geometric and photometric consistency, and editing flexibility required for long-tail autonomy research and safety validation. The data-driven, ground-truth-guided restoration pipeline and harmonization mechanism are broadly applicable to scene synthesis tasks demanding controllable, artifact-free generative fidelity (Liu et al., 25 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Symmetric Auto-regressive Online Restoration.

Symmetric Auto-regressive Online Restoration

1. Motivation and Objectives

2. Architectural Overview and Pipeline

3. Mathematical Formulation

3.1 Dual-view Restoration Objective

3.2 Auto-regressive Lateral View Generation

3.3 3DGS Refinement Losses

4. Dual-view Constraints for Fine-Grained Detail Recovery

5. Training-free Context-Aware Harmonization for Vehicle Insertion

6. Key Implementation Details

7. Quantitative Results and Comparative Evaluation

8. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Symmetric Auto-regressive Online Restoration

1. Motivation and Objectives

2. Architectural Overview and Pipeline

3. Mathematical Formulation

3.1 Dual-view Restoration Objective

3.2 Auto-regressive Lateral View Generation

3.3 3DGS Refinement Losses

4. Dual-view Constraints for Fine-Grained Detail Recovery

5. Training-free Context-Aware Harmonization for Vehicle Insertion

6. Key Implementation Details

7. Quantitative Results and Comparative Evaluation

8. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research