Spatial Temporal Patch Transformer

Updated 9 February 2026

The paper introduces a novel architecture that partitions video sequences into spatial-temporal patches for efficient hierarchical modeling.
It employs a pyramid of attention mechanisms, combining local windowed and global shifted self-attention for improved feature extraction.
Empirical results demonstrate superior action detection and classification performance with reduced computational complexity compared to full-attention models.

A Spatial Temporal Patchwise Transformer (STPT) is a class of deep neural architecture designed to efficiently and accurately model both local and global spatio-temporal dependencies in video or sequential visual data. The principal innovation of STPT lies in partitioning the input into small spatial-temporal patches (“patchwise”) and processing these using a pyramid of attention mechanisms: local windowed attention in shallow, high-resolution layers and global attention in deeper, low-resolution layers. This paradigm supports high-resolution temporal tasks such as action detection and fine-grained classification while maintaining computational efficiency compared to conventional full-attention Transformer designs (Zha et al., 2021, Weng et al., 2022).

1. Foundational Concepts

STPT builds on the premise that pure Transformer-based architectures offer flexible context modeling but scale quadratically in cost with sequence length and spatial resolution. Early spatio-temporal learning frameworks relied on ConvNets or sequential models (e.g., LSTM) to learn intra- and inter-frame structure but struggled to model complex, long-range dependencies. STPT leverages the tokenization of video streams into spatial-temporal “patches” or “cubes,” enabling both fine-grained local interaction and efficient aggregation over longer scopes by hierarchically mixing self-attention windows.

Concrete implementations, such as the Shifted Chunk Transformer (SCT) (Zha et al., 2021) and the Spatio-Temporal Pyramid Transformer (Weng et al., 2022), instantiate this general concept, demonstrating strong empirical performance on standard video understanding benchmarks.

2. Patchwise Partitioning and Embedding

The STPT framework operates on video clips formalized as tensors $\mathbf X\in\mathbb R^{T\times H\times W\times3}$ , where $T$ is the number of frames, and each frame has spatial dimension $H\times W$ .

Patch (or Chunk) Partitioning: Each frame is divided into non-overlapping “tiny patches” of size $h\times w$ , which are flattened into vectors $\mathbf p\in\mathbb R^{h\,w\,3}$ . Patches are then grouped into larger “chunks” or windows, typically comprising $m\times n$ adjoining patches.
3D Patch Embedding: As in (Weng et al., 2022), the video is embedded via a 3D depth-wise convolution:

$X_p = \mathrm{Conv3D}_{\mathrm{dw}}(X_{\mathrm{in}})$

with a kernel $P=3\times7\times7$ and stride $S=2\times4\times4$ , reducing the temporal and spatial dimensions and projecting features into channel space ( $C_1=96$ ).

Token Grid: After embedding, the video patch grid forms a sequence or set of tokens to be processed with subsequent attention modules.

3. Local and Shifted Attention Mechanisms

STPT alternates between two primary attention paradigms within its backbone:

3.1 Local Spatial-Temporal Attention

In the shallow, high-resolution stages, attention is confined to non-overlapping 3D windows (“patches” or “chunks”). For a given stage $s$ with input $X_{l-1}\in\mathbb R^{T_s\times H_s\times W_s\times d}$ , STPT applies Local Spatio-Temporal Attention (LSTA) as follows:

Query, Key, Value Computation:

$Q = XW_q,\quad K = XW_k,\quad V = XW_v$

Window Partitioning: The $Q$ , $K$ , $V$ tensors are partitioned into $N_{\mathrm{win}}$ windows, each of size $(w_1 w_2 w_3)$ .
Downsampling (Optional): Within each window $i$ , key and value can be locally pooled to $\bar K_i,\bar V_i$ by small strided 3D convolutions to reduce computation.
Local Attention: Standard dot-product attention is computed per window:

$\mathrm{Attention}(Q_i,\bar K_i,\bar V_i) = \mathrm{softmax}\left(Q_i\,\bar K_i^{\top}/\sqrt{d_k}\right)\,\bar V_i$

Complexity: This design scales linearly with the number of local windows for fixed window sizes.

3.2 Shifted Temporal Attention and Global Attention

Deeper in the hierarchy, after significant spatial-temporal downsampling, STPT switches to global self-attention or to explicitly shifted temporal attention.

Shifted MSA (Temporal Alignment): In (Zha et al., 2021), the SCT alternates with a shifted MSA which, for each patch position $p$ in frame $t$ , queries features from the previous frame $(t-1,p)$ via the key, thereby aligning spatial information across time:

$\mathbf k_{t,p}^i = \mathrm{LN}(\mathbf t'_{t-1,p})\,\mathbf W_K^i$

This structure forces explicit patchwise temporal alignment and models frame-to-frame motion at fine granularity.

Global Spatio-Temporal Attention: When resolution is sufficiently reduced (stages 3–4 in (Weng et al., 2022)), the entire spatio-temporal token grid participates in standard Transformer attention:

$\mathrm{Attention}(Q,\bar K,\bar V) = \mathrm{softmax}\!\left(Q\,\bar K^{\top}/\sqrt{d_k}\right)\,\bar V$

Hierarchical Pooling: After attention, pooling modules reduce spatial resolution and expand channel width, forming a pyramid and supporting progressively coarser but more semantically abstract representations.

4. Hierarchical Feature Learning and Temporal Aggregation

The pyramid structure is central to STPT efficiency:

Layer Organization: Stages alternate LSTA and GSTA, e.g., “LLGG” schedule (LSTA in shallow, GSTA in deep layers), as found effective by ablation (Weng et al., 2022).
Layer Pipeline (SCT (Zha et al., 2021)):

$M$ layers of local self-attention + MLP per chunk.
One global within-frame LSHAtt for efficient token mixing.
Linear pooling that halves spatial grid size.
Repeat for four stages, producing hierarchical spatial resolutions.

Temporal Aggregation: After spatial feature extraction, STPT aggregates temporal information. In SCT, a dedicated “clip encoder” is applied, forming a sequence of per-frame [CLS] tokens:

$\mathbf b_0 = [\mathbf b_{\mathrm{cls}};\, \mathbf a_{1,1}'\mathbf E';\, \dots;\, \mathbf a_{T,1}'\mathbf E'] + \mathbf E_{\mathrm{pos}}'$

Standard multi-layer Transformer blocks then produce a clip-level logit for classification.

Action Detection Head: In detection, outputs from late stages are fed to a Temporal Feature Pyramid Network (TFPN), producing multi-scale $1$D feature maps for anchor-free boundary regression, classification, and refinement (Weng et al., 2022).

5. Efficiency–Accuracy Trade-offs and Ablation

STPT architectures are designed to maximize descriptive power while minimizing redundancy and computation:

Locality-Redundancy Balance: Early layers handle local features; global interactions are deferred to later stages where the spatial-temporal grid is smaller. Pure local or pure global stacking is sub-optimal; hybrid pyramidal design delivers the best trade-off (Weng et al., 2022).
Empirical Effects:
- Tiny spatial patches (e.g., $4\times4$ ) and local windowed processing yield significantly better performance than large ViT-style patches (Zha et al., 2021).
- Explicit temporal shift in attention modules outperforms standard (unshifted) spatial-only attention.
- Ablating local/global mixture, shifted attention, or temporal hierarchy leads to measurable performance degradation in classification and detection tasks.

6. Experimental Performance and Benchmarks

STPT-based models achieve state-of-the-art results on video understanding tasks with notable computational efficiency:

Model	Task	Top-1 (Kinetics-400)	mAP (THUMOS14)	GFLOPs
SCT-L (Zha et al., 2021)	Action Recognition	83.0	–	342.6
STPT (LLGG) (Weng et al., 2022)	Action Detection	–	53.6	111.2
I3D+AFSD (RGB)	Action Detection	–	43.5	163.0

SCT outperforms ViViT-L and TimeSformer at similar or lower computational budget in action recognition.
STPT surpasses I3D+AFSD by 10.1% mAP on THUMOS14 while using ~31% fewer GFLOPs.
On ActivityNet 1.3, STPT attains mAP of 33.4% with 46% fewer GFLOPs than AFSD.

Performance gains are attributed to hierarchical patchwise modeling, locality-sensitive attention, explicit motion alignment, and pyramidal pooling (Zha et al., 2021, Weng et al., 2022).

7. Relation to Canonical Architectures and Extensions

STPT embodies a generalized approach to hierarchical, patchwise spatio-temporal modeling within the Transformer ecosystem:

Spatial Patchwise Transformer: Each frame is decomposed into small patches, facilitating local representation before broader aggregation.
Temporal Patchwise Attention: Modules such as shifted MSA align and track patchwise features across adjacent frames, supporting explicit motion modeling.
Hierarchical Processing: Progressive pooling and attention layers mirror ConvNet pyramids, adapting spatial resolution and semantic level with depth.

A plausible implication is that STPT’s pyramid structure, windowed self-attention design, and explicit temporal cross-patch alignment form a template for scalable, efficient spatio-temporal representation learning in diverse video analysis applications.

References:

(Zha et al., 2021): Shifted Chunk Transformer for Spatio-Temporal Representational Learning (Weng et al., 2022): An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Markdown Upgrade to Chat

References (2)

Shifted Chunk Transformer for Spatio-Temporal Representational Learning (2021)

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial Temporal Patchwise Transformer (STPT).