Papers
Topics
Authors
Recent
Search
2000 character limit reached

Symmetric Auto-regressive Online Restoration

Updated 31 December 2025
  • SAOR is a unified paradigm that restores and synthesizes high-fidelity novel views using symmetric dual-view constraints and auto-regressive inference.
  • It couples pretrained 3D Gaussian Splatting with conditional diffusion to achieve photorealistic rendering and context-aware asset harmonization.
  • Empirical evaluations show state-of-the-art performance in novel-view synthesis and asset insertion metrics, proving its value for autonomous driving simulation.

Symmetric Auto-regressive Online Restoration (SAOR) is a paradigm for high-fidelity scene synthesis and editing, designed to address persistent challenges in autonomous driving (AD) simulation: sparse coverage of rare, long-tail scenarios and the dual requirement of photorealistic rendering with structurally and visually coherent asset manipulation. SAOR achieves state-of-the-art performance by coupling ground-truth-guided dual-view restoration with auto-regressive lateral synthesis and inpainting-based harmonization, all embedded within a unified diffusion framework. The methodology leverages complementary 3D Gaussian Splatting (3DGS) and advanced conditional flow-matching diffusion strategies to generate spatio-temporally consistent novel views and robust context-aware asset insertions.

1. Motivation and Objectives

Data scarcity in AD, arising from rare-edge-case occurrences, motivates the development of simulators that can both expand coverage and preserve high fidelity. Conventional approaches fail to simultaneously provide photorealistic, spatio-temporally coherent novel view rendering and artifact-free, fine-grained traffic asset editing. Single-view diffusion restorers lack geometric groundings and require expensive supervision, while 3DGS renderers—though efficient and consistent on-axis—exhibit severe degradation in large-angle, off-axis novel view synthesis and cannot harmonize inserted objects in lighting and shadow. SAOR resolves these limitations by enforcing symmetric paired view constraints in a dual-view ground-truth restoration training regime, followed by auto-regressive propagation to synthesize consistent lateral views. The same restoration model is repurposed for training-free, context-aware masked inpainting to harmonize vehicle insertions, guaranteeing seamless shadow and lighting consistency (Liu et al., 25 Dec 2025).

2. Architectural Overview and Pipeline

The SAOR pipeline consists of four interlocking stages:

  • Pretrained 3D Gaussian Splatting (3DGS): Two independent 3DGS models are trained on ground-truth images for background scenes and foreground vehicles, respectively. At camera pose C0C_0, the ground-truth image I0I_0 is available, while lateral shifts ±d\pm d yield rendered views I+d=G(C+d)I_{+d} = \mathcal{G}(C_{+d}) and Id=G(Cd)I_{-d} = \mathcal{G}(C_{-d}).
  • Diffusion-Model Training: Training triplets (Id,I0,I+d)(I_{-d}, I_0, I_{+d}) are encoded using a variational autoencoder (VAE) to obtain latent representations (zd,z0,z+d)(z_{-d}, z_0, z_{+d}). Noise is injected into z0z_0: z0,t=(1t)z0+tϵz_{0,t} = (1-t) z_0 + t \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and a flow-matching diffusion network vθv_\theta is trained on concatenated conditionings [zd;z0,t;z+d][z_{-d}; z_{0,t}; z_{+d}] to minimize L=Ez0,ϵ,tvθ([zd;z0,t;z+d],t)(ϵz0)22L = \mathbb{E}_{z_0,\epsilon,t} \| v_\theta([z_{-d}; z_{0,t}; z_{+d}], t) - (\epsilon - z_0) \|^2_2.
  • Auto-regressive Inference: Starting from the central anchor view z0z_0, for each lateral step r{1,,R}r \in \{1,\ldots,R\}, the target CrdC_{r \cdot d}, anchor C(r1)dC_{(r-1)d}, and coarse rendering C(r+1)dC_{(r+1)d} are determined. Noise-initialization occurs at t=Nstartt = N_{start}: zrd,Nstart=(1σNstart)z(r1)d+σNstartϵz_{r \cdot d,N_{start}} = (1-\sigma_{N_{start}}) z_{(r-1)d} + \sigma_{N_{start}} \epsilon, followed by 50-step reverse diffusion from NstartN_{start} to $0$, conditioned on [z(r1)d;zrd,t;z(r+1)d][z_{(r-1)d}; z_{r\cdot d,t}; z_{(r+1)d}]. Decoding yields high-fidelity I~rd\tilde{I}_{r \cdot d} views, building a globally consistent novel-view chain.
  • 3DGS Refinement: Restored views I~rd\tilde{I}_{r \cdot d} inform extra supervision for the 3DGS model. The total loss Ltotal=Lgt+LnovelL_{total} = L_{gt} + L_{novel}, with Lgt=Lrgb+λ1Lssim+λ2LdepthL_{gt} = L_{rgb} + \lambda_1 L_{ssim} + \lambda_2 L_{depth} and Lnovel=Lrgb+λ1LssimL_{novel} = L_{rgb} + \lambda_1 L_{ssim}, closes the loop on scene geometry.

3. Mathematical Formulation

3.1 Dual-view Restoration Objective

Given central ground-truth view I0I_0 and symmetric rendered images I±dI_{\pm d}, encoding yields latents (zd,z0,z+d)(z_{-d}, z_0, z_{+d}). Injecting noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) at time tt leads to

z0,t=(1t)z0+tϵ.z_{0,t} = (1-t) z_0 + t \epsilon.

The flow-matching network vθv_\theta is trained via

Lrestore=Ez0,ϵ,tvθ([zd;z0,t;z+d],t)(ϵz0)22.L_{restore} = \mathbb{E}_{z_0,\epsilon,t} \| v_\theta([z_{-d}; z_{0,t}; z_{+d}], t) - (\epsilon - z_0) \|^2_2.

3.2 Auto-regressive Lateral View Generation

For each lateral generation step rr:

  • Anchor: a=z(r1)da = z_{(r-1)d}
  • Coarse: c=z(r+1)dc = z_{(r+1)d}

Noise initialization at t=Nstartt = N_{start} with zrd,Nstart=(1σNstart)a+σNstartϵz_{r \cdot d, N_{start}} = (1-\sigma_{N_{start}}) a + \sigma_{N_{start}} \epsilon. The reverse diffusion ODE is recursively solved:

zrd,t1=zrd,t+vθ([a;zrd,t;c],t)Δtz_{r \cdot d, t-1} = z_{r \cdot d, t} + v_\theta([a; z_{r\cdot d,t}; c], t) \Delta t

and finally decoded to produce I~rd\tilde{I}_{r\cdot d}.

3.3 3DGS Refinement Losses

The 3DGS refinement uses combined ground-truth and novel-view supervision:

Ltotal=Lgt+LnovelL_{total} = L_{gt} + L_{novel}

Lgt=Lrgb(Igt,I~)+λ1Lssim(Igt,I~)+λ2Ldepth(Dgt,Dpred),L_{gt} = L_{rgb}(I_{gt}, \tilde{I}) + \lambda_1 L_{ssim}(I_{gt}, \tilde{I}) + \lambda_2 L_{depth}(D_{gt}, D_{pred}),

Lnovel=Lrgb(Inovel,I~)+λ1Lssim(Inovel,I~).L_{novel} = L_{rgb}(I_{novel}, \tilde{I}) + \lambda_1 L_{ssim}(I_{novel}, \tilde{I}).

4. Dual-view Constraints for Fine-Grained Detail Recovery

Single-view restoration is fundamentally ill-posed due to occlusions and view-dependent lighting. By constructing paired symmetric views at ±d\pm d, the system exploits additional scene context: complementary structures visible in one view but occluded in another. The flow-matching network is able to triangulate geometry and texture, adaptively synthesizing per-pixel features by weighing cues from both zdz_{-d} and z+dz_{+d}. This consensus mechanism is responsible for the recovery of crisp lane markings, vehicle boundaries, and complex road textures across large lateral displacements.

5. Training-free Context-Aware Harmonization for Vehicle Insertion

Vehicle insertion is operationalized as a context-aware masked inpainting problem using the same diffusion backbone vθv_\theta:

  • Given a pre-inserted 3DRealCar image IinsertI_{insert} and binary mask MM, the inpainting uses flow-matching reverse diffusion with

zt1=zt+vθ([zinsert;zt;zinsert],t)Δtz'_{t-1} = z_t + v_\theta([z_{insert}; z_t; z_{insert}], t)\Delta t

and background consistency enforced via

zt1Mzt1+(1M)zt1origz_{t-1} \leftarrow M z'_{t-1} + (1-M) z^{orig}_{t-1}

where zorigz^{orig} is the latent of the original 3DGS rendering. Only the vehicle region adapts appearance for photometric consistency, while the background is invariant to restoration.

Final harmonization is achieved by tuning vehicle color cvc_v and opacity αv\alpha_v in the 3DRealCar model Gv\mathcal{G}_v to minimize

Lharm=I~insertI^insert22+μLssim(I~insert,I^insert),L_{harm} = \| \tilde{I}_{insert} - \hat{I}_{insert} \|^2_2 + \mu L_{ssim}(\tilde{I}_{insert}, \hat{I}_{insert}),

ensuring asset textures and shading match the harmonized diffusion output.

6. Key Implementation Details

SAOR utilizes:

  • Diffusion backbone: Flux.1-dev flow-matching model; conditional input via three VAE latents concatenated.
  • Fine-tuning: LoRA adapters, rank 128, trained on 4×A100 GPUs for 20,000 iterations.
  • VAE output: 256×256256 \times 256 latent resolution, decoded at unit scale.
  • Lateral inference: step size d=0.5d = 0.5 m.
  • Denoising: 50 reverse diffusion steps, noise initialization Nstart=10N_{start} = 10.
  • 3DGS refinement: base StreetGS trained for 50,000 steps; road geometry preprocessing via ground-point filtering.
  • Vehicle insertion: identical 50-step diffusion schedule with RePaint-style inpainting.

7. Quantitative Results and Comparative Evaluation

SAOR demonstrates superior performance in novel-view synthesis and realistic asset insertion on Waymo data (40-frame scenes, $3$ m lateral shift):

Method NTA-IoU (Cars) NTL-IoU (Lanes) FID (Realism)
StreetGS 0.498 53.19 130.75
FreeVS 0.505 53.26 104.23
ReconDreamer 0.539 54.58 93.56
ReconDreamer++* 0.566 56.89 75.22
Difix3D+ 0.578 56.94 84.12
SAOR (Ours) 0.582 57.91 74.82

Ablation studies confirm the importance of partial noise initialization (Nstart10N_{start} \approx 10) and moderate lateral step (d=0.5d = 0.5 m) for optimal lane reconstruction. Vehicle insertion benchmarks indicate lowest FID (32.60) for context-aware inpainting and fine-tuning, outperforming 3DRealCar-only (41.27), Difix3D+ (53.64), and CosXL-Edit (46.54). Qualitative results show robust preservation of lane markings, accurate handling of occlusions, and seamless shadow/reflection harmonization for inserted vehicles.

Note: ReconDreamer++ utilizes HD-map and bounding-box conditioning, whereas SAOR achieves results with no extra input.

8. Significance and Applications

Symmetric Auto-regressive Online Restoration constitutes a unified dual-view diffusion training, auto-regressive inference, and zero-shot masked inpainting system for novel-view enhancement and realistic 3D asset insertion. Its integration into AD simulation delivers photorealism, geometric and photometric consistency, and editing flexibility required for long-tail autonomy research and safety validation. The data-driven, ground-truth-guided restoration pipeline and harmonization mechanism are broadly applicable to scene synthesis tasks demanding controllable, artifact-free generative fidelity (Liu et al., 25 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Symmetric Auto-regressive Online Restoration.