GeoUNet: Topology-Aware Trajectory Synthesis
- GeoUNet is a UNet architecture that leverages geo-aware attention and RoadMAE embeddings to condition trajectory synthesis on road topology.
- The network combines multi-scale convolution, residual blocks, and cross-attention to ensure generated trajectories obey real-world geographic constraints.
- Empirical results show that GeoUNet outperforms baseline models in metrics like density, trip, and length errors, demonstrating superior fidelity and controllability.
GeoUNet is a UNet-shaped denoiser network tailored for conditional diffusion-based generation of geographic trajectories. It is the core component of ControlTraj, a controllable trajectory synthesis framework that incorporates road network topology and trip attributes to generate high-fidelity, human-directed trajectory data. GeoUNet integrates multi-scale convolutional features, geo-aware attention mechanisms, and residual connections, conditioned on embeddings of road topology (learned by a masked autoencoder, RoadMAE) and trip attributes, to guide reverse diffusion sampling and ensure that synthesized trajectories obey real-world geographic and topological constraints (Zhu et al., 2024).
1. Architecture of GeoUNet
GeoUNet adopts a symmetric UNet architecture characterized by a down-sampling path (encoder) and an up-sampling path (decoder), each comprising four hierarchical blocks:
- Down-Sampling Path: Consists of 4 Geo-Down blocks. Each block implements:
- Two ResNet sub-blocks (with GroupNorm, convolution, nonlinearity, and skip connection);
- Geo-self-attention and geo-cross-attention layers for feature fusion;
- Max-pooling for resolution reduction.
- Up-Sampling Path: Comprises 4 Geo-Up blocks. Each block performs:
- Nearest-neighbor or linear upsampling;
- Two ResNet sub-blocks;
- Geo-self and geo-cross-attention;
- Skip connections from the corresponding down-sampling block, preserving multi-scale context.
- Channel Progression: Channel dimension is set as (with typically 32 or 64) along the encoder, and is mirrored during decoding.
GeoUNet’s distinctive feature is geo-attention fusion at every block. For each block, feature maps are updated via combined geo-self-attention (intra-feature) and geo-cross-attention (interacting with the control vector ). The control vector concatenates the RoadMAE topological embedding and the Wide-and-Deep trip attribute embedding : Resulting attention outputs are computed as: where are the self-attention projections, and the cross-attention projections. This hierarchical fusion enables multi-scale contextual reasoning and direct topology injection at every resolution.
2. Topological Context Encoding via RoadMAE
GeoUNet leverages road network information through embeddings generated by RoadMAE, a masked transformer autoencoder trained on sequences of raw GPS points representing road segments. The processing pipeline includes:
- Patchifying: Each road segment is partitioned into patches, , for fixed patch size .
- Random Masking: A binary mask with ratio is applied, masking out patches during training for self-supervised reconstruction.
- Transformer Encoder/Decoder: The encoder extracts as the fine-grained topological embedding, while the decoder reconstructs masked input points to minimize the loss:
The resulting (frozen at generation time) encapsulates road segment connectivity and geometry, enabling topology-aware trajectory synthesis in GeoUNet, without requiring explicit adjacency matrices or Laplacian regularization.
3. Conditional Diffusion Process
GeoUNet is employed as the denoising network in a conditional diffusion model for trajectory generation. The process is formulated as follows:
- Forward Process (Noising): For real trajectory ,
By reparameterization:
- Reverse Process (Denoising, Conditioned on ): where
The mean is parameterized as
with . The noise is estimated by GeoUNet.
4. Training and Inference Procedures
The training of GeoUNet in the ControlTraj framework proceeds as:
- RoadMAE Pretraining: The RoadMAE autoencoder is pretrained via and then weights are frozen for downstream trajectory synthesis.
- Diffusion Model Training: For each sampled real trajectory and random time step ,
- Compute .
- Use GeoUNet to predict .
- Minimize the mean squared error:
- Total Loss: When end-to-end, the total loss combines diffusion and RoadMAE terms: though typically only is used during GeoUNet training as RoadMAE is frozen.
- Hyperparameters: Typical values are learning rate , batch size 1024, diffusion steps (linear schedule to ), skip steps = 5 (for DDIM acceleration), and embedding dimension .
- Sampling: At inference, embeddings and are computed from trip attributes and road segments, concatenated as , and provided to GeoUNet for denoising sampled white noise . DDIM acceleration is supported by skipping steps.
5. Empirical Evaluation and Performance
Experiments were conducted on trajectory data from Chengdu (5.7M trips), Xi’an (3.0M), and Porto (1.7M). Data was preprocessed to standardize trajectory lengths (filtering, interpolation, truncation) and to extract trip attributes.
- Evaluation Metrics: All metrics use Jensen–Shannon divergence (JSD) to compare generated to real data distributions:
- Density error: spatial coverage in gridded space.
- Trip error: distribution of start/end points.
- Length error: trajectory distance distribution.
Baselines: VAE, TrajGAN, DP-TrajGAN, DiffWave, DiffTraj, as well as ControlTraj variants without conditioning or with vanilla autoencoder rather than RoadMAE.
- Results (Chengdu):
- Density error: DiffTraj 0.0066 vs. ControlTraj 0.0039.
- Trip error: DiffTraj 0.0143 vs. ControlTraj 0.0106.
- Length error: DiffTraj 0.0174 vs. ControlTraj 0.0117.
- Downstream Utility: Generated and real data yield less than 5% difference in RMSE/MAE/MAPE for ASTGCN, GWNet, MTGNN, DCRNN-based traffic-flow forecasting.
- Controllability: When supplied a prescribed route (sequence of road segments), ControlTraj/GeoUNet strictly follows the intended topology, outperforming unconditional diffusion models.
- Generalizability: GeoUNet exhibits strong zero-shot transfer: training on Chengdu and testing on Xi’an yields a density error of 0.0171 (ControlTraj), compared to 0.0806 (DiffTraj) and 0.0544 (ControlTraj-AE).
- Qualitative Outputs: Visualization includes geo-plots of trajectories, rush-hour and trip-volume heatmaps, and assessments of RoadMAE masking impacts (0–75%).
6. Significance, Limitations, and Context
GeoUNet advances trajectory data synthesis by melding deep convolutional denoisers with explicit topology- and attribute-based conditioning, enabled by architectural innovations in geo-attention and by leveraging a robust transformer-based autoencoder for fine-grained topology embedding. Its ability to tightly control generated outcomes with respect to specified routes, attributes, and road network context surpasses prior approaches that lack such integrated conditioning or suffer from degraded fidelity and transferability in novel geographic environments.
A plausible implication is that GeoUNet’s method of indirect topology injection (via cross-attention rather than explicit Laplacian regularization) offers superior scalability and generalization, though it is possible that explicit relational constraints might be preferred for certain graph-structured domains outside urban mobility.
GeoUNet currently relies on frozen RoadMAE embeddings, which suggests constraints on adaptability to evolving road networks or dynamic topologies; end-to-end fine-tuning or joint pretraining strategies may be explored in future work. Finally, while the current attention mechanisms encode descriptive context, they may not enforce strict physical infeasibility constraints (e.g., preventing illegal trajectory transitions)—this is an area for additional research if stricter guarantees are necessary (Zhu et al., 2024).