Tensorized Spatiotemporal Convolution

Updated 16 December 2025

Tensorized spatiotemporal convolution is a method that decomposes high-dimensional operations into efficient, low-rank tensor factorizations across spatial and temporal dimensions.
It reduces memory and computation by employing schemes like Tucker, CP, or TT decompositions, making it scalable for video, graph, and multi-dimensional time-series applications.
Practical models such as FT-STGCN and CT-Net demonstrate its ability to achieve state-of-the-art performance with significant parameter and FLOP reductions.

Tensorized spatiotemporal convolution refers to a class of convolutional operators, architectures, and computational techniques that exploit tensor factorizations, high-order algebraic structures, or mode-wise decompositions to achieve more efficient, expressive, or scalable modeling of data with coupled spatial and temporal dimensions. Unlike naive spatiotemporal convolutions—which directly parameterize full 3D (or generally N-dimensional) kernels and often incur prohibitive memory or compute cost—tensorized approaches leverage the multi-way structure of data and parameters, decomposing convolutions into sequences of smaller or separable operations, or expressing data and filters as low-rank tensors. This framework encompasses not only video models and 3D ConvNets but also graph-based and sequential architectures handling dynamic, graph-structured, or more general spatiotemporal phenomena.

1. Tensorized Spatiotemporal Convolution: Core Mathematical Operators

A variety of tensorized convolutional operators have been introduced to perform simultaneous filtering along spatial and temporal axes:

Factorized Convolution via Tensor Decomposition (Tucker, CP, TT): The input (e.g., video clip, graph signals, or multi-channel time series) and/or convolutional kernels are reshaped into high-order tensors and approximated by low-rank decompositions. For example, in traffic forecasting, the data tensor $X \in \mathbb{R}^{N \times D \times T}$ is factorized via the Tucker formulation $X \approx C \times_1 X_S \times_2 X_F \times_3 X_T$ with $C$ the core tensor and $X_S, X_F, X_T$ the factor matrices along spatial, feature, and temporal modes, respectively (Xu et al., 2021). CP and tensor-train decompositions achieve analogous efficiency by expressing kernels as sums or products of lower-dimensional factors (Kossaifi et al., 2019, Su et al., 2020).
Channel Tensorization and Multimodal Factorization: In video recognition, the channel dimension itself may be recursively factorized (e.g., $C = \prod_{k=1}^K C_k$ ), enabling a series of separable convolutions along sub-channel axes and thus facilitating higher-dimensional interactions with reduced overhead (Li et al., 2021).
Spatiotemporal Product and M-Product on Graphs: Dynamic and temporal graphs are represented by stacking slices into third-order tensors, and tensor products like the M-product enable joint spatial and temporal propagation. Given node–feature tensors $X \in \mathbb{R}^{N \times F \times T}$ and adjacency tensors $A \in \mathbb{R}^{N \times N \times T}$ , update rules such as $H^{(\ell+1)} = A \bigstar H^{(\ell)}$ (with $\bigstar$ denoting the M-product) perform a temporal linear transform (mode-3), face-wise multiplication (spatial), and inverse temporal transform, capturing both contemporaneous and historical neighborhood structure (Han, 22 Apr 2025).
Spatiotemporal Graph Convolution with Product Graphs: Convolutions constructed on the Cartesian, Kronecker, or parametric product of spatial and temporal graphs support shift-and-sum filtering and polynomial kernels mixing spatial and temporal operators (Isufi et al., 2021).

2. Decomposition Strategies and Computational Advantages

Tensorized convolutional techniques are motivated largely by their substantial reductions in memory and computation:

Parameter and FLOP Reduction: Classical N-dimensional convolution has parameter count scaling as $C_{\text{in}} \times C_{\text{out}} \times \prod_{k=0}^{N-1} K_k$ . Tensor factorizations reduce this drastically, e.g., to $R \left(C_{\text{in}} + C_{\text{out}} + \sum_k K_k \right)+1$ for CP decompositions (rank $R$ ), or to the sum of core plus mode factors in Tucker/TT (Kossaifi et al., 2019, Xu et al., 2021, Su et al., 2020). For Tucker-based spatiotemporal graph convolution, the memory required drops from $O(NDT)$ (dense) to $O(Nn + Dd + Tt + ndt)$ for factorized components (with $n,d,t \ll N,D,T$ ) (Xu et al., 2021).
Computation and Parallelism: Once decomposed, the full spatiotemporal convolution can be constructed from three or more small matrix multiplications—e.g., mode-specific spatial, temporal, or channel projections—followed by a small-scale core reconstruction, enabling highly parallel and efficient implementations (Xu et al., 2021). The decomposition also enables the use of highly optimized 2D kernels rather than general-purpose 5D tensor convolutions, a critical advantage for hardware that lacks native support for high-dimensional operations (Hajimolahoseini et al., 23 Jul 2024).
Dynamic Graph Lifting: For time-evolving graphs, tensorized propagation via the M-product provides a single operation to mix history and aggregate neighbors, whereas traditional pipelines require separate GCN and sequence models, which disrupts joint dependencies and increases parameterization (Han, 22 Apr 2025).

3. Practical Implementations and Model Architectures

Diverse architectures have implemented tensorized spatiotemporal convolution, adapting the paradigm to specific domains:

Factorized Spatial-Temporal Tensor Graph Convolutional Networks (FT-STGCN): For traffic speed prediction, FT-STGCN models the entire road network as a spatiotemporal tensor, applies mode-wise projections via graph adjacency matrices and temporal adjacency slices, and reconstructs outputs from a compact Tucker core, yielding low overhead and multi-way denoising (Xu et al., 2021).
Channel-Tensorized and Tensor-Separable Video Models (CT-Net): Video classification architectures such as CT-Net apply channel tensorization, taking the channel axis as a product of sub-dimensions and employing tensor-separable spatial and temporal convolutions, along with tensor-excitation mechanisms (attention gates per modality) to enhance joint spatiotemporal-context modeling (Li et al., 2021).
CP-Higher-Order Convolutions with Transduction: High-order CNNs (e.g., for affect analysis) exploit CP decompositions for convolutional kernels, supporting training on 2D data with later “transduction” to higher-order (temporal) modes via addition of extra CP factors, enabling knowledge transfer and rapid adaptation to spatiotemporal tasks (Kossaifi et al., 2019).
Dynamic Graph Models using M-Product: Tensorized lightweight GCNNs embed both feature and adjacency evolution in the time mode, using invertible temporal mixing matrices to perform global temporal aggregation and graph convolution simultaneously, achieving true joint spatiotemporal receptive fields with low complexity (Han, 22 Apr 2025).
Tensor-Train Factorized Convolutional LSTM: The Conv-TT-LSTM model factorizes multi-lag spatiotemporal kernels into tensor-train structure, thus supporting higher-order memory with efficient parameter cost, critical for multi-frame video prediction and activity recognition (Su et al., 2020).
Winograd-Class Multicore Tensor Convolution: On general CPU hardware, N-dimensional convolution operations can be reformulated as chains of mode-wise Winograd transforms and matrix multiplications, tuned for cache and vectorization, enabling high-throughput spatiotemporal convolution previously only feasible on GPUs (Budden et al., 2016).

4. Empirical Results and Benchmarks

Experimental evaluation across multiple domains demonstrates consistent advantages for tensorized spatiotemporal convolution:

Model/Paper	Domain/Dataset	Key Metrics & Results
FT-STGCN (Xu et al., 2021)	Traffic (SZ-taxi etc)	RMSE reduction (SZ-taxi: 3.994→3.108), R² increase (0.852→0.911); lower MAE, variance explained, params
CT-Net (Li et al., 2021)	Video classification	Top-1 Kinetics400: 77.3% (R50, 16f), 52.5% (SS-V1), 40% FLOP reduction vs. vanilla 3D ResBlock
HO-CPConv (Kossaifi et al., 2019)	Emotion estimation	AFEW-VA: Valence RMSE 0.20, CCC 0.57 vs. 3D-ResNet 0.26/0.17; 3–10× parameter reduction
Conv-TT-LSTM (Su et al., 2020)	Video prediction	Moving-MNIST: MSE 12.96, 500× fewer kernel params; KTH: SSIM 0.907, PSNR 28.36 dB, outperforming E3D-LSTM
TLGCN (Han, 22 Apr 2025)	Dynamic graphs	Outperforms decoupled GCN+SNN in weight estimation (4 datasets); reduced memory, no feature transforms
4D-tensor reshape (Hajimolahoseini et al., 23 Jul 2024)	Video	ECO-Lite: 51% parameter/compute ≈ 12% speedup, accuracy > Conv3D, 91.2% UCF top-1 after finetuning

These results indicate that, in both graph-structured and Euclidean domains, tensorization achieves state-of-the-art or competitive accuracy while offering dramatic parameter, complexity, and memory advantages.

5. Mode Decomposition, Receptive Field, and Expressivity

Tensorized operators are designed to preserve and efficiently enlarge the spatiotemporal receptive field while retaining multi-mode dependencies:

Multi-Scale Aggregation: Double sums over spatial and temporal hops (e.g., $\sum_{k_S=0}^p \sum_{k_T=0}^p$ in FT-STGCN) build hierarchical receptive fields in both domains (Xu et al., 2021).
Progressive Channel Mixing and Attention: Channel tensorization allows for progressive mixing (increasing channel interactions at each subdimension), and tensor-excitation gates provide modality-specific feature re-weighting (Li et al., 2021).
Spatiotemporal Denoising and Rank Truncation: Tucker and CP decompositions allow for principled truncation of minor principal components, providing implicit noise suppression analogous to multi-way PCA (Xu et al., 2021, Kossaifi et al., 2019).
Temporal Generalization via Transduction: By adding or fine-tuning additional modes in CP factorization, previously-trained lower-dimensional models can be extended to the spatiotemporal domain with minimal parameter growth and no loss of previously learned representations (Kossaifi et al., 2019).

6. Algorithmic and Hardware Considerations

Tensorized spatiotemporal convolutions are often critical for practical deployment on resource-constrained or high-throughput environments:

Efficient Operator Fusion and Tiling: On CPUs, transforming the computation into chain-mode product operations allows cache-aware tiling, sparsity exploitation, fused multiply-accumulate pipelines, and parallel scheduling (Budden et al., 2016).
Reduced Tensor Dimensionality: By expressing the full 3D convolution as a composition of 2D and 1D convolutions on 4D or 3D tensors (with appropriate reshaping and merging), accelerators without native support for 5D tensors can achieve high throughput and lower memory consumption (Hajimolahoseini et al., 23 Jul 2024).
Model Simplicity for Dynamic Graphs: By omitting nonlinear activations and feature transforms (as in LightGCN), tensorized dynamic graph models achieve much lower memory consumption while retaining task performance (Han, 22 Apr 2025).
Parallel Computation and Layerwise Construction: Splitting high-dimensional convolutions into mode-specific operators enables coalesced parallel computation and easy interleaving of batch normalization and activation, supporting scalable model width and depth (Xu et al., 2021).

7. Limitations, Open Challenges, and Future Directions

While tensorized spatiotemporal convolution provides substantial efficiency and expressivity advantages, several limitations and frontiers persist:

Rank Selection and Decomposition Overheads: Choosing decomposition ranks (for Tucker, CP, or TT) requires empirical tuning or data-driven strategies; too strict truncation can degrade performance, and some decompositions require nontrivial computational cost during training or initialization.
Expressivity–Efficiency Trade-offs: While parameter savings are often large, some models (e.g., CP-HO-CNN, Conv-TT-LSTM) show that additional expressivity from larger ranks or hybrid decompositions may be needed to match the full accuracy of uncompressed models on certain tasks, though empirical results generally show no accuracy loss (Kossaifi et al., 2019, Su et al., 2020).
Integration with Non-Euclidean Domains: While substantial progress has been made in graph and relational domains, harmonizing tensorized convolution with arbitrary non-Euclidean structures, edge dynamics, and adaptive graph coupling remains a rich area for exploration (Isufi et al., 2021, Han, 22 Apr 2025).
Hardware Portability and Edge Deployment: The movement from high-dimensional convolution operators to 2D/1D-only primitives is crucial for edge deployment, but demands further work on automated graph-to-tensor conversions and operator fusion for real-world embedded scenarios (Hajimolahoseini et al., 23 Jul 2024).

Tensorized spatiotemporal convolution remains a central paradigm for efficient and expressive modeling in high-dimensional dynamic, video, and graph-structured data, with ongoing developments continually extending its applicability and impact across domains (Xu et al., 2021, Li et al., 2021, Kossaifi et al., 2019, Budden et al., 2016, Han, 22 Apr 2025, Su et al., 2020, Hajimolahoseini et al., 23 Jul 2024, Isufi et al., 2021).