Papers
Topics
Authors
Recent
Search
2000 character limit reached

Seoul World Model (SWM)

Updated 20 March 2026
  • Seoul World Model (SWM) is a large-scale video simulation system that generates spatially faithful urban trajectories using geo-indexed street-view data.
  • It integrates autoregressive diffusion with retrieval-augmented conditioning and cross-temporal pairing to ensure long-horizon stability and realistic outputs.
  • Empirical evaluations show SWM achieves superior metrics in FID, FVD, and image quality, marking significant progress in urban video simulation.

Seoul World Model (SWM) is a large-scale, retrieval-augmented autoregressive video world simulation model anchored to the real urban geography and structure of Seoul. Unlike prior world models that generate visually plausible but entirely imagined environments, SWM leverages geo-indexed street-view imagery to produce kilometer-scale video predictions that are spatially faithful and temporally coherent, enabling controlled simulation over real city layouts with support for diverse camera movements and textual prompts (Seo et al., 16 Mar 2026).

1. Architectural Overview

SWM integrates a pretrained autoregressive video diffusion transformer with a suite of retrieval and conditioning mechanisms purpose-built for city-scale simulation. The system architecture is organized into five sequential modules:

  1. Input Specification: Defines the start location (c0c_0), camera trajectory (CC), and text prompt (PP).
  2. Street-View Retrieval: Selects KK geo-indexed panoramas near the target route, with associated depth and pose metadata.
  3. Conditioning Encoder: Applies both depth-based geometric referencing (warping) and semantic referencing (latent injection). A Virtual Lookahead Sink provides a persistent future scene latent for long-horizon stability.
  4. Autoregressive Diffusion: A diffusion transformer in 3D-VAE latent space rolls out video in temporal chunks, ingesting previous history, trajectory, noisy target latents, references, and text.
  5. VAE Decoding: Generated latents are decoded into RGB video frames.

This design supports continuous rollouts over trajectories of hundreds of meters by chaining sequential chunks, each conditioned on both reference imagery and a predictive anchor.

2. Mathematical Formalism

Let X={xt}t=0T1X = \{x_t\}_{t=0}^{T-1} denote a sequence of TT consecutive frames and Z={zl}l=0L1Z = \{z_l\}_{l=0}^{L-1} their corresponding VAE latents, where L=T/4L = T/4 under temporal compression. Video is produced in NN chunks; for chunk ii:

  • C(i)C^{(i)}: Camera poses for the chunk
  • P(i)P^{(i)}: Text prompt
  • Zhist(i)Z_\text{hist}^{(i)}: Tail latents from the previous chunk
  • Znoisy(i)Z_\text{noisy}^{(i)}: Gaussian-noised target latents

The denoising process follows:

Zt(i)=αtZ0(i)+1αt2ε,εN(0,I),tUniform{1,,Tdiff}Z_t^{(i)} = \alpha_t Z_0^{(i)} + \sqrt{1-\alpha_t^2}\, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,I), \quad t \sim \text{Uniform}\{1, \ldots, T_\text{diff}\}

The objective is the standard diffusion reconstruction loss:

Ldiff=EZ0,ε,t[εεθ(Zt(i);cond(i))22]L_\text{diff} = \mathbb{E}_{Z_0,\varepsilon,t}\left[\|\varepsilon - \varepsilon_\theta(Z_t^{(i)}; \text{cond}^{(i)})\|_2^2\right]

where cond(i)\text{cond}^{(i)} aggregates camera trajectory, text, reference images, warped frames, Virtual Lookahead Sink latent, and other conditioning sources.

3. Retrieval-Augmented Conditioning and Cross-Temporal Pairing

SWM's hallmark is retrieval-augmented conditioning: for each chunk, up to KK reference street-view panoramas are selected via a two-stage process:

  1. Geo-Euclidean proximity: Select KgeoK_\text{geo} panoramas closest to the trajectory using projected latitude/longitude.
  2. Depth-coverage filtering: For each candidate, compute Scov(k)S_\text{cov}(k), the fraction of reprojected pixels that overlap with the target view. Retain references exceeding a threshold τcov\tau_\text{cov}.

To ensure the model distinguishes static urban geometry from scene transients (e.g., vehicles, pedestrians), SWM employs cross-temporal pairing during training, enforcing a minimum time offset tcapture(target)tcapture(ref)Δt|t_\text{capture}(\text{target}) - t_\text{capture}(\text{ref})| \geq \Delta_t between target frames and reference images. This data-structural constraint compels the model to leverage persistent spatial cues rather than simply copying dynamic foreground elements.

4. Long-Horizon Stabilization and View Interpolation

A core technical challenge is mitigating accumulation error and drift during long-horizon synthesis. SWM addresses this via a "Virtual Lookahead Sink": a future anchor latent zVL(i)z_\text{VL}^{(i)} (drawn from the nearest real image to chunk endpoint cT1(i)c_{T-1}^{(i)}) is appended to the input token sequence at each chunk boundary. Relevant positions are offset in rotational-position embeddings (RoPE) to preserve temporal order. During training, ground-truth future latents at random distances serve as the sink to teach anchor reliance.

Street-view imagery is only available at discrete, spatially sparse locations. To synthesize continuous motion, SWM utilizes a view interpolation pipeline:

  • Construct a freeze-frame video sequence by repeating each keyframe, with small shifts for in-between frames.
  • Encode the extended sequence with a pretrained 3D-VAE.
  • During diffusion training, only clean keyframe latents condition the model.
  • After denoising, keep one frame per keyframe and discard repeats.

Empirically, this improves adherence to the keyframe (PSNR increases from 22.5 to 25.0, LPIPS decreases from 0.245 to 0.162 on benchmark datasets).

5. Dataset Construction and Training Paradigms

In addition to leveraging real-world street-view, SWM's training set contains 12,700 synthetic videos rendered in CARLA across six urban maps, including:

  • Pedestrian trajectories (sidewalks, crosswalks)
  • Vehicle trajectories (urban, highway, intersections)
  • Free-camera paths (randomized 3D trajectories)

Synthetic videos are aligned with real-world locations via GPS and rendered under varying lighting, weather, and traffic to enhance robustness.

Training operates in two stages:

  • Teacher-Forcing Pretraining: Condition on ground-truth latents.
  • Self-Forcing Fine-Tuning: Gradually replace history with the model's own predictions to enforce consistency with past generations.

Classifier-free guidance drops text, references, or warped frames at random to enable partial or unconditional inference at test time. Cross-temporal regularization is implicit via pairing rather than explicit loss.

6. Evaluation and Empirical Results

SWM is evaluated on Busan-City-Bench and Ann-Arbor-City-Bench datasets (30 sequences × 365 frames each), excluding any reference imagery from test routes. Quantitative assessment spans:

  • Visual & Temporal Metrics: FID, FVD, VBench image quality
  • Camera-following Accuracy: Rotational error (RotErr), translational error (TransErr)
  • 3D Adherence in Static Regions: mPSNR, mLPIPS outside dynamic-object masks

A summary of benchmarking results:

Method FID ↓ FVD ↓ ImgQ ↑ RotErr ↓ TransErr ↓ mPSNR ↑ mLPIPS ↓
Aether 141.2/132.8 1096.5/1214.8 0.55/0.51 0.030/0.078 0.083/0.192 11.10/13.03 0.671/0.635
DeepVerse 130.3/182.9 892.6/1524.9 0.53/0.46 0.062/0.251 0.103/0.469 12.20/13.43 0.679/0.727
Yume1.5 54.8/85.6 425.2/993.6 0.73/0.61 0.153/0.326 0.104/0.271 12.09/14.15 0.667/0.623
HY-World1.5 49.6/67.0 544.0/864.8 0.78/0.54 0.044/0.193 0.079/0.221 11.87/14.26 0.588/0.575
FantasyWorld 83.5/67.7 783.1/917.6 0.63/0.49 0.056/0.215 0.141/0.302 10.01/11.97 0.654/0.592
LingBot 62.1/58.0 717.4/1039.5 0.75/0.60 0.081/0.269 0.073/0.239 10.48/12.51 0.645/0.641
SWM (TF) 28.4/56.6 301.8/640.2 0.78/0.66 0.020/0.055 0.015/0.154 14.56/15.18 0.392/0.481
SWM (SF) 32.5/44.0 325.9/779.9 0.77/0.57 0.028/0.217 0.033/0.208 13.52/14.20 0.478/0.573

SWM achieves the lowest FID and FVD, best camera-following, and highest static-region image quality among all tested models. Notably, text-prompted edits (e.g., "flood," "sunset," "snowfall") preserve underlying geometry, and virtual lookahead anchoring enables stable video generation across >1 km trajectories.

7. Limitations and Future Prospects

Current limitations include reliance on street-view datasets with spatial and temporal sparsity, necessitating the synthetic video augmentation and view interpolation pipeline. While robust to real city structure, density of reference images constrains fine-grained adherence in some situations. Long-horizon consistency, even with Virtual Lookahead Sinks, could degrade under compounding irregularities in pose estimation or reference coverage.

Future work envisions incorporating denser real capture streams, extension to multiple city geographies, and further scaling of model and retrieval capacity. An implicit suggestion is that generalization across both real and synthetic domains will benefit from continued advances in simulation realism and representational alignment (Seo et al., 16 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Seoul World Model (SWM).