Controlled Spatiotemporal Infilling Methods

Updated 9 May 2026

Controlled spatiotemporal infilling is the task of synthesizing missing data in time-varying, spatially structured datasets while preserving observed context and enforcing domain-specific constraints.
Recent advances leverage conditional diffusion models, VAEs, transformers, and partial convolutional architectures to guide and control the infilling process with high fidelity.
Empirical benchmarks in areas like urban traffic, air quality, video generation, and motion infilling demonstrate significant improvements in accuracy and perceptual quality using tailored training protocols.

Controlled spatiotemporal infilling is the task of completing or synthesizing missing or masked segments in time-varying, spatially structured datasets or representations, such that the filled regions are consistent with contextual constraints in both space and time, and—where applicable—can be targeted or guided by user-supplied conditions. Applications span sensor network imputation, dynamic system forecasting, video generation, NeRF editing, human motion completion, and mobile trajectory synthesis. Recent advances have established new state-of-the-art by adopting conditional generative models, structured neural architectures, and explicit constraint mechanisms to deliver controllable, high-fidelity infilling across diverse modalities.

1. Theoretical Foundations and Problem Definition

Controlled spatiotemporal infilling generalizes classical imputation and trajectory interpolation by incorporating user-specified, algorithmically enforced constraints on the infill region. Formally, let $\mathbf X \in \mathbb{R}^{N \times L}$ denote the spatiotemporal data matrix (e.g., $N$ sensors over $L$ time steps), together with a mask $\mathbf M \in \{0,1\}^{N \times L}$ with $\mathbf M_{i,\ell} = 1$ if entry $(i,\ell)$ is observed. The task is to generate $\hat{\mathbf X}$ such that $\hat{\mathbf X}_{i,\ell} = \mathbf X_{i,\ell}$ for all $(i,\ell)$ with $\mathbf M_{i,\ell}=1$ , and $N$ 0 is synthesized for $N$ 1, subject to the following:

Consistency with observed spatial and temporal context.
Adherence to domain-specific constraints (e.g., global motion boundary, geometry, physics).
Optional conditioning on auxiliary information or prompts for content control.

Stochastic approaches aim to sample from the conditional distribution $N$ 2, where $N$ 3 encodes context or control variables. For discrete event sequences, the task extends to infilling both the location and temporal (or other) attributes across missing sequence spans with joint spatial-temporal dependencies and external controls.

2. Algorithmic Architectures and Conditioning Mechanisms

Recent developments have adopted deep generative paradigms, leveraging diffusion models, VAEs, transformer architectures, partial convolutions, and spatiotemporal implicit neural representations (INRs), often incorporating problem- or domain-specific conditioning strategies.

Conditional Diffusion Framework: The CoFILL model (He et al., 8 Jun 2025) formulates infilling as a conditional denoising diffusion process. Each reverse (denoising) step is conditioned on features $N$ 4 derived from both observed data and graph structure:

Forward noising:

$N$ 5

Reverse denoising:

$N$ 6

Condition embedding: Fused via TCN (temporal), GCN (spatial), and DCT (frequency) streams, then cross-attention, yielding a context tensor supplied to all reverse steps.

Conditional VAEs for 2D Region Infilling: In the CVAE model (Ribeiro et al., 2023), conditioning is implemented by stacking boundary frames and query frames with optional signed distance fields, enabling the decoder to control output at arbitrary time interpolants $N$ 7.

Stochastic Dynamics for Video Infilling: SDVI (Xu et al., 2018) employs a bi-directional ConvLSTM for constraint propagation and a spatially-aware stochastic generator, enforcing path and endpoint consistency at every generation step.

Imprecisely Timed Keyframes: In the motion infilling model of (Goel et al., 2 Mar 2025), a transformer-based diffusion network predicts both a global time-warp $N$ 8—parameterized as a cumulative sum of per-frame warp slopes—and local spatial residuals, guaranteeing that keyframe constraints are respected up to minor temporal alignment corrections.

Spatiotemporal Partial Convolutional U-Nets: The 3D partial convolution model (Han et al., 2023) extends image inpainting to space-time histograms, updating binary masks after each layer and employing biased masking strategies during training to focus network capacity on high-density (dynamic) urban areas.

Implicit Neural Representations (INRs): Temporal and spatial continuity are encoded via a single MLP $N$ 9 parameterizing the attenuation field in dynamic CT (Boulanger et al., 9 Oct 2025), with conditioning through learned high-frequency embeddings (INCODE) for task adaptation.

Transformer-based Controlled Trajectory Generation: TrajGPT (Hsu et al., 2024) treats controlled spatiotemporal trajectory infilling as a sequence-to-sequence text infilling task, combining explicit geographic embedding (Space2Vec), temporal encoding (Time2Vec), and a multi-head transformer with GMM output heads for region and timestamp prediction, driven by autoregressive sampling under spatial or temporal constraints.

3. Training and Inference Protocols

Standard protocol involves removing (masking) blocks, intervals, or arbitrary patterns from the data and jointly optimizing for accurate imputation, often with added noise for stochastic tasks or for de-noising objectives.

CoFILL:
- Random masking regime samples pointwise, blockwise, or hybrid missing patterns.
- Dual preliminary imputations (forward interpolation, Gaussian-noise fill) provide initial context.
- Training loss is simplified as squared $L$ 0 prediction error between actual diffusion noise and output of the conditioning denoiser:
$L$ 1 - Inference proceeds by reverse diffusion, starting from Gaussian noise and iteratively computing the conditional mean and variance.
CVAE for Spatiotemporal Masks: Uses standard evidence lower bound (ELBO), with the β-weighted KL term and cross-entropy reconstruction matched to mask probability outputs.
Video Infilling (SDVI): Employs a symmetric variational loss with paired (posterior, prior) KL divergences and $L$ 2 pixel losses for both prediction and reconstruction; boundary constraints are imposed by initial ConvLSTM states and explicit cell state updates.
Trajectory Generation (TrajGPT): Losses comprise negative log likelihoods of GMM region/timestamp predictions, and beam/top- $L$ 3 sampling at generation time allows user-imposed constraints on region or temporal span.
Spatiotemporal Inpainting (3D Partial Conv): The network is trained under an $L$ 4 loss split into masked (hole) and unmasked (valid) terms,

$L$ 5

critical for numerical fidelity in downstream aggregation.

4. Empirical Performance and Benchmarking

Controlled spatiotemporal infilling models are evaluated on diverse datasets and metrics, using experimental setups designed to stress both global accuracy and local/physical plausibility in the infilled regions.

CoFILL (Air Quality, Urban Traffic):

On AQI-36 (simulated failure): MAE = 8.70 (vs. PriSTI 9.03), MSE = 296.5 (vs. 310.4), CRPS improved by up to 9.8% (He et al., 8 Jun 2025).
On METR-LA: MAE = 1.67 (↓10.2%), MSE = 9.42 (↓11.96%).
Qualitative results capture both high-frequency fluctuations and slow trends.

C-VAE (Burn Area Segmentation):

Intersection-over-Union (IoU): 0.82 (vs. U-Net 0.79), symmetric Hausdorff distance 5.4 (vs 6.1), temporal smoothness $L$ 6 of 0.09 (vs. 0.15 for U-Net) (Ribeiro et al., 2023).

Video Infilling (SDVI):

LMS and PSNR/SSIM scores superior or equal to interpolation/prediction baselines; stochastic sampling avoids mode collapse and enables long-range, physically plausible infill (Xu et al., 2018).

3D Partial Conv (NYC Taxi/Bikeshare):

MAE on masked voxels: 0.72 (vs. global mean 1.26, 3D-NN 1.36), with consistent performance gains in dense/dynamic urban regions (Han et al., 2023).

Implicit Neural Representations (Dynamic XCT):

PSNR improvements of 7–9 dB over TIMBIR, SSIM gains up to 0.35, robust to undersampling and noise (Boulanger et al., 9 Oct 2025).

Trajectory Generation (TrajGPT):

Temporal accuracy (e.g., $L$ 7) improved by up to $L$ 8 on GeoLife, with spatial ( $L$ 9) ≥98% of the best model (Hsu et al., 2024).

5. Extensions: Modality-Specific and User-Controlled Infilling

Scene and Motion Editing (NeRF/4DNeRF): In Inpaint4DNeRF (Jiang et al., 2023), controlled infilling operates at the 3D or 4D (dynamic) level via mask propagation, seed-view latent inpainting with Stable Diffusion models (e.g., ControlNet), planar depth proxies, and iterative NeRF finetuning. Promptable semantic control and geometric proxies ensure that both spatial and temporal consistency are maintained, and that user-specified edits propagate coherently throughout the scene in space and time. Representative quantitative results include multiview PSNR/SSIM of 24–26 dB/0.85, with a user study confirming 87% multiview-consistency.

Keyframe-Based Human Motion Infilling: Generative infilling with retiming (Goel et al., 2 Mar 2025) handles noisy keyframe times via a flexible time-warp function and spatial residuals, outperforming pure hard-constraint and imputation baselines on L2-Acc (e.g., 0.60 vs. 1.52) and key-pose error (0.019 vs. 0.120).

Trajectory Infilling with Constraints: TrajGPT (Hsu et al., 2024) permits infilling arbitrary sequence gaps under additional constraints (e.g., only visits in a given spatial area or time window), achieved by training over masked and answer-annotated sequences and decoding under context-aware transformer queries.

6. Constraint Enforcement, Limitations, and Domain Adaptation

Controlled infilling models encode constraints through architecture, loss, and conditioning mechanisms:

Hard fidelity: In all cases, observed values are preserved and serve as boundary or anchor points.
User-guidance: Via prompt tokens, normalized time variables, auxiliary masks, or explicit latent variables, the user can specify content, timing, or target region.
Regularization: Models incorporate residual connections, dropout, layer normalization, total variation, and in some cases, explicit artifact correction (e.g., detector/undersampling biases in XCT).

Limitations identified include susceptibility to poor performance in context-free or extremely sparse regions, increased computational burden with high-resolution or long sequences, and limited extrapolative ability for truly out-of-distribution motions, shapes, or trajectories.

7. Outlook and Future Research Directions

Controlled spatiotemporal infilling continues to evolve with advances in deep generative modeling. Current research is oriented toward:

Scalability: Efficient parallelized optimization and batched 4D inference on high-dimensional data (e.g., large-scale dynamic tomography (Boulanger et al., 9 Oct 2025)).
Interactivity: More flexible user interfaces for specifying constraints, e.g., natural-language, sketches, geometric markers.
Multimodal fusion: Integration of external information (weather, events, social context).
Continuous representations: Relaxing grid discretization in favor of continuous, coordinate-based representations or neural fields.
Improved uncertainty quantification: Leveraging the stochastic nature of diffusion, VAEs, and transformer sampling to provide credible intervals or ensemble-based assessments for scientific and policy applications.

The convergence of architectural innovations (diffusion, transformer, INR, partial convolution), advanced conditioning methods, and explicit constraint enforcement positions controlled spatiotemporal infilling as a central enabling technology for robust, user-aware completion and synthesis in complex spatiotemporal domains.