Spatiotemporal Cube Tokenization

Updated 28 May 2026

Spatiotemporal cube tokenization is a technique that converts high-dimensional, multi-axis data into discrete tokens, retaining both spatial and temporal information for neural processing.
It employs fixed-size 3D patches, adaptive refinement, and hierarchical quantization to balance compression fidelity with computational efficiency.
This approach drives advancements in video modeling, physical system forecasting, and Earth observation by enabling scalable, interpretable representations.

Spatiotemporal cube tokenization refers to the transformation of high-dimensional, multi-axis data—where two or more axes are spatial and at least one axis is temporal—into discrete, compact representations ("tokens") suitable for consumption by neural models. This approach underpins efficient learning, analysis, compression, and synthesis in domains such as video modeling, physical system forecasting, Earth observation, and dynamic 3D scene prediction. While canonical implementations partition inputs into regular, fixed-size 3D patches ("cubes"), state-of-the-art methods extend this by hierarchically or adaptively allocating representational budget, with trade-offs in compression fidelity, computational cost, and downstream expressiveness.

1. Mathematical Formulation of Spatiotemporal Cube Tokenization

At its core, spatiotemporal cube tokenization restructures an input tensor—such as $X \in \mathbb{R}^{T\times H\times W\times C}$ for video, $O_t \in \mathbb{R}^{H\times W\times Z}$ for 3D occupancy, or $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ for sensor records—into a set of non-overlapping (or adaptive) cubes. For regular tokenization, each cube is defined as:

$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$

where $(T_s, H_s, W_s)$ are the temporal and spatial sizes. Each flattened cube forms a "token" after projection via a learned embedding or quantization. This procedure yields a grid of $N = N_t N_h N_w$ tokens ( $N_t = \lfloor T/T_s \rfloor$ , etc.) (Atanov et al., 14 Apr 2026, Maldonado et al., 23 Sep 2025, Liao et al., 12 Jul 2025).

Advanced variants assign tokens adaptively or hierarchically. In adaptive schemes, cubes are refined selectively based on local data complexity, e.g., variance, producing a variable set of tokens per frame (Zhang et al., 2024). In hierarchical schemes, successive quantizations account for both coarse and fine residuals, and temporal aggregation may be performed either by explicit temporal cubes or by aggregating residuals over time (Liao et al., 12 Jul 2025).

2. Canonical Architectures and Vector Quantization

Standard spatiotemporal tokenizers typically employ a stack of 3D convolutions to project input cubes into latent space, followed by vector quantization or learned embeddings:

Encoder (Generic):

Input: $X \in \mathbb{R}^{T\times H\times W\times C}$
Partition into non-overlapping cubes $(p_T, p_H, p_W)$
Apply 3D CNNs to obtain latent grid $Z_e \in \mathbb{R}^{T'\times H'\times W'\times d}$
(If quantized) For each cube, $O_t \in \mathbb{R}^{H\times W\times Z}$ 0 codebook index $O_t \in \mathbb{R}^{H\times W\times Z}$ 1, token $O_t \in \mathbb{R}^{H\times W\times Z}$ 2 (Maldonado et al., 23 Sep 2025).

Vector Quantization Loss:

$O_t \in \mathbb{R}^{H\times W\times Z}$ 3

GAN-based adversarial refinement modules (e.g., 3D CNN discriminators with hinge losses) are commonly employed to mitigate artifacts such as temporal misalignment or motion smearing, with composite objectives combining $O_t \in \mathbb{R}^{H\times W\times Z}$ 4, perceptual, and adversarial losses (Maldonado et al., 23 Sep 2025). Codebook size selection is data-dependent: compact codebooks (e.g., $O_t \in \mathbb{R}^{H\times W\times Z}$ 5) suffice for dense 2D motion, but genuinely volumetric (3D) data may require $O_t \in \mathbb{R}^{H\times W\times Z}$ 6 or more for faithful reconstruction (Maldonado et al., 23 Sep 2025).

Multi-scale residual quantization approaches extend this framework by successive quantization and upsampling/downsampling, compressing both spatial detail and long-range scene changes (Liao et al., 12 Jul 2025).

3. Adaptive and Hierarchical Tokenization Approaches

The primary bottleneck in transforming large-volume spatiotemporal data into neural architectures arises from the quadratic complexity of self-attention with respect to token count. Adaptive tokenization, as in MATEY (Zhang et al., 2024), employs dynamically chosen patch sizes—coarse in homogeneous regions, fine in complex/high-variance ones:

Adaptive Cube Tokenization:
- Coarse partitioning into large patches $O_t \in \mathbb{R}^{H\times W\times Z}$ 7
- Compute local variance $O_t \in \mathbb{R}^{H\times W\times Z}$ 8 per patch
- Mark patches exceeding a threshold for further subdivision into sub-token-scale patches $O_t \in \mathbb{R}^{H\times W\times Z}$ 9
- Fusion schemes: multi-resolution (disjoint sequences) or mixed-resolution (single fused sequence)

This results in token sequences whose average length is a function of marked patch count and refinement factor, efficiently balancing expressiveness and computational cost. Empirically, adaptive tokenization can halve the number of tokens required while surpassing uniform fine grids in accuracy-versus-cost trade-offs (Zhang et al., 2024).

Hierarchical tokenization, as realized in VideoFlexTok (Atanov et al., 14 Apr 2026) and $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 0-World (Liao et al., 12 Jul 2025), underpins coarse-to-fine information allocation. In VideoFlexTok, "register" tokens are organized such that early tokens encode global semantic/motion structure, and later tokens incrementally add fine detail; training employs nested dropout to enforce this hierarchy. $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 1-World combines intra-scene (spatial, multi-scale) residual quantization with inter-scene (temporal, residual aggregate) tokenization for efficient dynamic 4D forecasting.

4. Continuous and Implicit Spatiotemporal Cube Representations

Continuous implicit neural fields represent a conceptually distinct approach, encoding the entire spatiotemporal cube as a learned function. The GeoNDC architecture (Qi et al., 26 Mar 2026) parameterizes the data cube $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 2 with an MLP $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 3 using multi-resolution hash embeddings for both spatial and spatiotemporal coordinates. This achieves several critical properties:

Tokenization as Function Parameterization: The weights $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 4 and any (sparse) quantized residuals together "tokenize" the entire archive.
Continuous Query and Interpolation: Arbitrary $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 5 queries are mapped to the desired value without explicit cube tiling or finite token sequences; continuous $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 6 enables sub-frame temporal interpolation.
Compression and Fidelity: GeoNDC attains $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 795:1 compression ratio (vs. Int16 baseline), mean per-band $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 8, and maintains efficient query mechanisms (81 $D = \{(x, y, t) \mapsto \mathbf{v}(x, y, t) \in \mathbb{R}^C\}$ 9 faster than file I/O on regional queries).

This implicitly learned tokenization collapses the need for fixed or adaptive cube partitioning, instead representing the data in the neural parameter space, and supporting differentiable, high-fidelity reconstructions (Qi et al., 26 Mar 2026).

5. Quantitative Impacts, Benchmarks, and Trade-Offs

Systematic analysis across domains reveals that spatiotemporal cube tokenization provides substantial compression and computational gains, but efficacy depends on data complexity, application, and tokenizer architecture.

Method	Domain	Typical Token Count	Baseline Ratio	Fidelity Metrics	Compression
3D-grid (VQ-GAN) (Maldonado et al., 23 Sep 2025)	2D/3D motion heatmaps	$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 0 (patch-factor reduced)	dVAE baseline	2D SSIM 0.975 (+5.4%), 3D SSIM 0.934 (+9.3%), $\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 1 -37.1%	F8–F16
VideoFlexTok (Atanov et al., 14 Apr 2026)	Generative video	$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 2 tokens/frame (variable, e.g. $\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 3)	$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 4D-grid %%%%33 $Z_e \in \mathbb{R}^{T'\times H'\times W'\times d}$ 34%%%% more	gFVD 80.0, $\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 7 (Kinetics-600, 160 tokens/clip)	$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 8 fewer tokens
GeoNDC (Qi et al., 26 Mar 2026)	Earth obs. cubes	$\mathrm{cube}_{i,j,k} = X[\,iT_s:(i+1)T_s,\, jH_s:(j+1)H_s,\, kW_s:(k+1)W_s,\, :]$ 9 voxels, parameter vector	Int16 and float64 raster	Mean $(T_s, H_s, W_s)$ 0, RMSE 0.021	95:1 vs. Int16
MATEY (Zhang et al., 2024)	PDE/Physics	Variable, adaptive	Uniform fine/cube	NRMSE reduced by 20-30% for 10-20% token increase	Halved tokens at same error
$(T_s, H_s, W_s)$ 1-World (Liao et al., 12 Jul 2025)	4D occupancy	$(T_s, H_s, W_s)$ 2, S=3, G=history	Baseline VQ-VAE	mIoU +25.1%, IoU +36.9% over SOTA	Real-time (37 FPS)

Performance metrics are dataset- and architecture-specific, but the consensus is robust: (1) dense or hierarchical cube tokenization vastly outperforms uniform flattening, (2) adaptive/hierarchical schemes reduce computational cost while improving or preserving accuracy, (3) codebook size must be matched to signal complexity (Maldonado et al., 23 Sep 2025, Atanov et al., 14 Apr 2026, Zhang et al., 2024, Liao et al., 12 Jul 2025, Qi et al., 26 Mar 2026).

6. Applications and Future Directions

Spatiotemporal cube tokenization underlies several advanced modeling paradigms:

Generative Video: 3D grid tokenization enables efficient Transformer-modeling of video; hierarchical methods (VideoFlexTok) allow cost-fidelity trade-offs for long-range content (Atanov et al., 14 Apr 2026).
Human Motion and Activity Recognition: Dense cube tokenization with adversarial refinement achieves state-of-the-art in dynamic motion analysis and compression (Maldonado et al., 23 Sep 2025).
Multiscale Physical System Modeling: Adaptive tokenization as in MATEY efficiently models multiresolution physical processes (e.g., PDEs, turbulence) with improved NRMSE and lower token counts (Zhang et al., 2024).
Earth Observation: Implicit cube tokenization (GeoNDC) supports planetary-scale data analysis, on-demand queries, and extreme compression (Qi et al., 26 Mar 2026).
Occupancy and World Models: Hierarchical spatiotemporal cube tokenization, decoupling intra/inter-scene, enables real-time 4D scene forecasting for autonomous systems (Liao et al., 12 Jul 2025).

A plausible implication is that future directions may see increased integration of adaptive, neural, and hierarchical tokenization strategies with domain-specific architectures, hybridizing the compactness and query efficiency of neural fields with the flexibility and compositionality of discrete token sequences.

7. Guidelines and Considerations

Empirical findings across the literature indicate the following best practices:

Codebook Tuning: Optimal codebook sizes depend on modality; start with $(T_s, H_s, W_s)$ 3 for dense 2D, $(T_s, H_s, W_s)$ 4 for 3D volumetric data (Maldonado et al., 23 Sep 2025).
Compression Scaling: Trade-offs between factor (e.g., F8, F16) and reconstruction fidelity are best determined empirically per domain (Maldonado et al., 23 Sep 2025, Qi et al., 26 Mar 2026).
Adversarial Training: GAN-based refinement is essential to achieve temporally coherent, artifact-free reconstructions in motion-intensive settings (Maldonado et al., 23 Sep 2025).
Adaptive/Hierarchical Design: Employ variance-based patch refinement or residual quantization to balance computational cost against modeling expressiveness (Zhang et al., 2024, Liao et al., 12 Jul 2025).
Continuous/Implicit Methods: For applications demanding high compressibility, continuous neural fields offer a maximally compact, queryable encoding, though at the cost of explicit token interpretability (Qi et al., 26 Mar 2026).
Trajectory and ROI Support: Explicit and implicit tokenization schemes both support efficient region-of-interest or trajectory queries; implicit fields further enable continuous interpolation and on-demand gradients (Qi et al., 26 Mar 2026).

Spatiotemporal cube tokenization continues to be a central primitive not only for compression, but also for scalable deep learning on real-world spatiotemporal systems, enabling major advances in Earth observation, video understanding, and dynamic scene modeling.