diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories

Published 11 May 2026 in cs.AI and cs.CR | (2605.10647v1)

Abstract: Trajectories are nowadays valuable information for a wide range of applications. However they are also inherently sensitive, as they contain highly personal information about individuals. Facing this challenge, synthesizing mobility trajectories has emerged as a promising solution to leverage mobility information while preserving privacy. State-of-the-art models, often rely on the false assumptions of generative models implicit privacy and fails to provide privacy guarantees while preserving trajectories utility. Here, we introduce diffGHOST, a conditional diffusion model based on latent space segmentation, designed to answer this challenge. Thus, this paper propose a methodology that identify and mitigate memorization of critical samples using condition segments of a learn latent space.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a novel framework combining VAE-based latent segmentation with conditional diffusion to generate synthetic trajectories that balance utility and privacy.
It demonstrates significant improvements over existing models by achieving lower errors in preserving complex spatiotemporal dynamics and realistic movement patterns.
It employs rigorous segment-level privacy audits with localized Laplacian noise, effectively eliminating memorization risks while maintaining high data fidelity.

diffGHOST: A Diffusion-Based Conditional Generation Framework for Privacy-Preserving Mobility Trajectories

Introduction and Motivation

The paper "diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories" (2605.10647) introduces a novel methodological advancement in synthetic trajectory generation, explicitly targeting the dual challenges of utility and privacy. Mobility traces are invaluable for myriad applications—including urban planning and epidemiology—but are inherently sensitive, with re-identification and misuse risks even after dataset anonymization. Existing generative models, including diffusion-based approaches such as DiffTraj, have demonstrated utility but exhibit susceptibility to memorization and privacy attacks, with no formal guarantees against information leakage. diffGHOST proposes an architecture and framework to synthesize trajectories with rigorous, segment-level non-memorization guarantees while retaining fine-grained utility of the generated data.

Model Architecture and Methodology

The framework is structured in four principal stages: (1) VAE-based latent encoding, (2) latent space segmentation, (3) conditional diffusion-based generation, and (4) post-hoc memorization auditing and mitigation via local noise addition.

The first component (E $_1$ ) is a standard variational autoencoder (VAE), implemented as a 1D CNN encoder-decoder, trained to minimize reconstruction error and enforce a prior on the latent representations through a KL divergence regularization. Upon convergence, the learned latent $\boldsymbol{\mu}(T_u)$ serves as a compact structural embedding of each trajectory.

Following encoding, the model clusters the latent representations using a KDTree (E $_2$ ), enabling semantic segmentation of trajectories into regions of geometric and behavioral homogeneity.

Figure 1: Trajectory projection workflow through VAE latent encoding and KDTree-based segmentation in diffGHOST.

The core generative module (E $_3$ ) is a conditional diffusion model. Unlike prior work that conditions on explicit attributes (e.g., trip start/stop), diffGHOST conditions generation on the segment identifier assigned in the latent space. The architecture is a 1D UNet with self-attention and residual connections, using FiLM layers for hierarchical injection of both timestep and latent segment conditioning.

Figure 2: Overview of the E $_3$ trajectory diffusion model architecture with segment-level conditioning and guidance.

Classifier-Free Guidance (CFG) is integrated, controlling the trade-off between fidelity to the segment condition and sample diversity; high guidance scale enforces strict adherence to segment properties.

The critical privacy guarantee (E $_4$ ) is based on segment-wise post-generation analysis for memorization risk. For each segment, synthetic traces are compared to real traces in latent proximity using Fréchet distance. If a trajectory fails a nearest neighbor distance ratio criterion—signaling potential memorization of training data—it is flagged. For flagged segments, Laplacian noise is added post-hoc within a theoretically-justified radius, ensuring $k$ -anonymity and indistinguishability within the segment, while minimally impacting utility for others.

Empirical Evaluation and Results

Utility Analysis

Experiments are conducted on both a controlled procedural dataset and the real-world GeoLife GPS corpus. Visual and quantitative evaluations demonstrate that diffGHOST consistently generates realistic spatial and temporal patterns, outperforming DiffTraj, VAE, and standard Gaussian perturbation baselines.

Figure 3: Ground-truth examples for comparative visual evaluation between synthetic and real trajectory distributions.

Across metrics such as density error, pattern (OD flow) score, average speed preservation, map adherence, G-rank for point popularity, transition probabilities, and traffic flow prediction, diffGHOST either achieves the strongest or second-best scores. Notably, improvements are especially pronounced in the preservation of complex movement dynamics (e.g., pointwise transition probabilities and traffic flow), where diffGHOST yields order-of-magnitude lower errors than baselines. The integration of CFG provides a tunable lever: increasing guidance enforces stricter adherence to segment distribution at the expense of diversity, a property consistently validated across both datasets.

Privacy Evaluation

A salient contribution is the empirical characterization and mitigation of memorization. By varying the generative upsampling factor per segment, the study shows that as many as 18–51% of sampled segments can exhibit memorization risks (as detected by the nearest-neighbor test on procedural and GeoLife datasets, respectively).

Figure 4: Segment-wise count of synthetic samples flagged as memorized, with privacy thresholds and risk flagging across upsampling values.

For all flagged segments, after localized Laplacian noise is applied within the prescribed radius (as derived from local Lipschitz constants and segment geometry), the memorization rate is reduced to zero across all tested conditions. This claim is strongly evidenced by the empirical results: after mitigation, not a single condition remains at risk of memorization under the adopted test. Importantly, the noise is kept minimal and tightly bounded, limiting degradation in utility. Utility metrics demonstrate limited or no significant drop even after repeated post-hoc noise application, provided the noise radius is chosen judiciously.

Implications, Theoretical Considerations, and Future Directions

diffGHOST offers an operational, segment-level approach to balancing privacy and utility that goes beyond conventional differentially private data synthesis (which often sacrifices crucial micro-structure for theoretical guarantees). Its per-segment privacy audit and mitigation framework is more flexible and precise than global approaches, enabling local risk assessment and targeted intervention. The method is applicable to high-dimensional, complex data domains like spatiotemporal trajectories where global DP is infeasible.

The explicit invocation of latent space segmentation and Lipschitz-constrained post-processing is theoretically sound, and the empirical results indicate that memorization vulnerability is highly segment-dependent—suggesting that future evaluations of generative model privacy should be conducted locally, not only globally.

The approach has several practical implications for privacy-preserving mobility data publishing, model auditing, and adaptive privacy control. Potential extensions include integration with formal differential privacy mechanisms, application to multimodal or sequence-plus-categorical trajectory data, and adaptation for other high-dimensional, structured synthetic data domains.

Integration of more advanced risk analyses (e.g., membership inference, attribute inference) and adaptation of the noise calibration protocol following the latest $k$ -anonymity or individual-DP methodologies are promising next steps.

Conclusion

diffGHOST presents a principled, multi-stage synthetic trajectory generation framework that combines VAE-based latent segmentation, conditional diffusion modeling, and segment-localized privacy mitigation. Empirical findings attest to robust performance in both utility and privacy, substantially extending the state of the art in privacy-aware mobility data synthesis. The framework sets a new methodological precedent for fine-grained, distribution-aware generative modeling with precise privacy guarantees.

Markdown Report Issue