GeoUNet: Topology-Aware Trajectory Synthesis

Updated 6 February 2026

GeoUNet is a UNet architecture that leverages geo-aware attention and RoadMAE embeddings to condition trajectory synthesis on road topology.
The network combines multi-scale convolution, residual blocks, and cross-attention to ensure generated trajectories obey real-world geographic constraints.
Empirical results show that GeoUNet outperforms baseline models in metrics like density, trip, and length errors, demonstrating superior fidelity and controllability.

GeoUNet is a UNet-shaped denoiser network tailored for conditional diffusion-based generation of geographic trajectories. It is the core component of ControlTraj, a controllable trajectory synthesis framework that incorporates road network topology and trip attributes to generate high-fidelity, human-directed trajectory data. GeoUNet integrates multi-scale convolutional features, geo-aware attention mechanisms, and residual connections, conditioned on embeddings of road topology (learned by a masked autoencoder, RoadMAE) and trip attributes, to guide reverse diffusion sampling and ensure that synthesized trajectories obey real-world geographic and topological constraints (Zhu et al., 2024).

1. Architecture of GeoUNet

GeoUNet adopts a symmetric UNet architecture characterized by a down-sampling path (encoder) and an up-sampling path (decoder), each comprising four hierarchical blocks:

Down-Sampling Path: Consists of 4 Geo-Down blocks. Each block implements:
- Two ResNet sub-blocks (with GroupNorm, convolution, nonlinearity, and skip connection);
- Geo-self-attention and geo-cross-attention layers for feature fusion;
- Max-pooling for resolution reduction.
Up-Sampling Path: Comprises 4 Geo-Up blocks. Each block performs:
- Nearest-neighbor or linear upsampling;
- Two ResNet sub-blocks;
- Geo-self and geo-cross-attention;
- Skip connections from the corresponding down-sampling block, preserving multi-scale context.
Channel Progression: Channel dimension is set as $\{C, 2C, 4C, 8C\}$ (with $C$ typically 32 or 64) along the encoder, and is mirrored during decoding.

GeoUNet’s distinctive feature is geo-attention fusion at every block. For each block, feature maps $h^i \in \mathbb{R}^{L \times d}$ are updated via combined geo-self-attention (intra-feature) and geo-cross-attention (interacting with the control vector $c$ ). The control vector $c$ concatenates the RoadMAE topological embedding $z_L$ and the Wide-and-Deep trip attribute embedding $z_{\rm attr}$ : $c = [z_{\rm attr};\,z_L].$ Resulting attention outputs $\tilde h$ are computed as: $\tilde h = \mathrm{softmax}\left(\frac{Q_s K_s^\top}{\sqrt d}\right)V_s + \mathrm{softmax}\left(\frac{Q_c K_c^\top}{\sqrt d}\right)V_c,$ where $Q_s,K_s,V_s$ are the self-attention projections, and $Q_c,K_c,V_c$ the cross-attention projections. This hierarchical fusion enables multi-scale contextual reasoning and direct topology injection at every resolution.

2. Topological Context Encoding via RoadMAE

GeoUNet leverages road network information through embeddings generated by RoadMAE, a masked transformer autoencoder trained on sequences of raw GPS points representing road segments. The processing pipeline includes:

Patchifying: Each road segment $r \in \mathbb{R}^{2 \times L}$ is partitioned into $N = \lceil L/P \rceil$ patches, $\mathrm{Patch}(r) \in \mathbb{R}^{N \times 2P}$ , for fixed patch size $P$ .
Random Masking: A binary mask $M$ with ratio $r_o$ is applied, masking out patches during training for self-supervised reconstruction.
Transformer Encoder/Decoder: The encoder extracts $z_L \in \mathbb{R}^{N \times D}$ as the fine-grained topological embedding, while the decoder reconstructs masked input points to minimize the loss: $\mathcal{L}_{ssl} = \| (r - \tilde r) \odot M \|_2^2.$

The resulting $z_L$ (frozen at generation time) encapsulates road segment connectivity and geometry, enabling topology-aware trajectory synthesis in GeoUNet, without requiring explicit adjacency matrices or Laplacian regularization.

3. Conditional Diffusion Process

GeoUNet is employed as the denoising network in a conditional diffusion model for trajectory generation. The process is formulated as follows:

Forward Process (Noising): For real trajectory $x_0$ ,

$q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1}), \quad q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I).$

By reparameterization: $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon,\qquad \bar\alpha_t = \prod_{i=1}^t (1 - \beta_i),\; \epsilon \sim \mathcal{N}(0,I).$

Reverse Process (Denoising, Conditioned on $c$ ): $p_\theta(x_{0:T-1}\mid x_T,c) = \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t, c),$ where

$p_\theta(x_{t-1}\mid x_t, c) = \mathcal{N}\big(x_{t-1}; \mu_\theta(x_t, t, c), \sigma_\theta(x_t, t, c)^2 I\big).$

The mean is parameterized as

$\mu_\theta(x_t, t, c) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, t, c)\right),$

with $\sigma_\theta(x_t, t, c) = \sqrt{\tilde\beta_t}$ . The noise $\epsilon_\theta(x_t, t, c)$ is estimated by GeoUNet.

4. Training and Inference Procedures

The training of GeoUNet in the ControlTraj framework proceeds as:

RoadMAE Pretraining: The RoadMAE autoencoder is pretrained via $\mathcal{L}_{ssl}$ and then weights are frozen for downstream trajectory synthesis.
Diffusion Model Training: For each sampled real trajectory $x_0$ $x_{0}$ and random time step $t$ $t$ ,
1. Compute $x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon$ .
2. Use GeoUNet to predict $\hat\epsilon = \epsilon_\theta(x_t, t, c)$ .
3. Minimize the mean squared error:
$\mathcal{L}_{diff}(\theta) = \mathbb{E}_{x_0, t, \epsilon, c} \| \epsilon - \epsilon_\theta(x_t, t, c) \|_2^2.$
Total Loss: When end-to-end, the total loss combines diffusion and RoadMAE terms: $\mathcal{L}_{total} = \mathcal{L}_{diff} + \lambda \mathcal{L}_{ssl},$ though typically only $\mathcal{L}_{diff}$ is used during GeoUNet training as RoadMAE is frozen.
Hyperparameters: Typical values are learning rate $1 \times 10^{-4}$ , batch size 1024, $T=500$ diffusion steps (linear $\beta$ schedule $1\times 10^{-4}$ to $5\times 10^{-2}$ ), skip steps = 5 (for DDIM acceleration), and embedding dimension $d=128$ .
Sampling: At inference, embeddings $z_{\rm attr}$ and $z_L$ are computed from trip attributes and road segments, concatenated as $c$ , and provided to GeoUNet for denoising sampled white noise $x_T$ . DDIM acceleration is supported by skipping steps.

5. Empirical Evaluation and Performance

Experiments were conducted on trajectory data from Chengdu (5.7M trips), Xi’an (3.0M), and Porto (1.7M). Data was preprocessed to standardize trajectory lengths (filtering, interpolation, truncation) and to extract trip attributes.

Evaluation Metrics: All metrics use Jensen–Shannon divergence (JSD) to compare generated to real data distributions:
1. Density error: spatial coverage in gridded space.
2. Trip error: distribution of start/end points.
3. Length error: trajectory distance distribution.
Baselines: VAE, TrajGAN, DP-TrajGAN, DiffWave, DiffTraj, as well as ControlTraj variants without conditioning or with vanilla autoencoder rather than RoadMAE.
Results (Chengdu):
- Density error: DiffTraj 0.0066 vs. ControlTraj 0.0039.
- Trip error: DiffTraj 0.0143 vs. ControlTraj 0.0106.
- Length error: DiffTraj 0.0174 vs. ControlTraj 0.0117.
Downstream Utility: Generated and real data yield less than 5% difference in RMSE/MAE/MAPE for ASTGCN, GWNet, MTGNN, DCRNN-based traffic-flow forecasting.
Controllability: When supplied a prescribed route (sequence of road segments), ControlTraj/GeoUNet strictly follows the intended topology, outperforming unconditional diffusion models.
Generalizability: GeoUNet exhibits strong zero-shot transfer: training on Chengdu and testing on Xi’an yields a density error of 0.0171 (ControlTraj), compared to 0.0806 (DiffTraj) and 0.0544 (ControlTraj-AE).
Qualitative Outputs: Visualization includes geo-plots of trajectories, rush-hour and trip-volume heatmaps, and assessments of RoadMAE masking impacts (0–75%).

6. Significance, Limitations, and Context

GeoUNet advances trajectory data synthesis by melding deep convolutional denoisers with explicit topology- and attribute-based conditioning, enabled by architectural innovations in geo-attention and by leveraging a robust transformer-based autoencoder for fine-grained topology embedding. Its ability to tightly control generated outcomes with respect to specified routes, attributes, and road network context surpasses prior approaches that lack such integrated conditioning or suffer from degraded fidelity and transferability in novel geographic environments.

A plausible implication is that GeoUNet’s method of indirect topology injection (via cross-attention rather than explicit Laplacian regularization) offers superior scalability and generalization, though it is possible that explicit relational constraints might be preferred for certain graph-structured domains outside urban mobility.

GeoUNet currently relies on frozen RoadMAE embeddings, which suggests constraints on adaptability to evolving road networks or dynamic topologies; end-to-end fine-tuning or joint pretraining strategies may be explored in future work. Finally, while the current attention mechanisms encode descriptive context, they may not enforce strict physical infeasibility constraints (e.g., preventing illegal trajectory transitions)—this is an area for additional research if stricter guarantees are necessary (Zhu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ControlTraj: Controllable Trajectory Generation with Topology-Constrained Diffusion Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GeoUNet.