Temporal Shift Module (TSM)

Updated 15 December 2025

TSM is a parameter-free module that injects temporal context via channel-wise feature shifts, facilitating efficient video and sequential data processing.
It achieves computational efficiency by remapping indices without extra parameters or FLOPs, outperforming traditional 3D convolutions in speed.
TSM finds applications in action recognition, video synthesis, and speech modeling, with implementations in both CNNs and Transformer-based architectures.

A Temporal Shift Module (TSM) is a parameter-free, nearly zero-computation architectural primitive designed to inject temporal context into standard feed-forward neural networks. The innovation is a channel-wise shift of intermediate feature maps along the temporal dimension, enabling information flow from adjacent frames or timesteps without resorting to explicit temporal convolutions or recurrent connections. TSM has driven advancements in efficient video understanding, action recognition, latent diffusion for video, speech sequence modeling, and even video inpainting, all while retaining the computational characteristics of 2D backbones. Its key property is leveraging time-adjacency via simple re-indexing, allowing highly efficient scaling and deployment, particularly in resource-constrained or real-time settings (Lin et al., 2019, Duong et al., 29 Jan 2025, An et al., 2023, Shen et al., 2023, Hashiguchi et al., 2022, Chang et al., 2019).

1. Mathematical Formulation and Core Mechanism

A TSM operates on a feature tensor $X \in \mathbb{R}^{N \times T \times C \times H \times W}$ (batch, time, channels, height, width). For each frame $t$ , channels are partitioned into three contiguous groups: a forward-shift set ( $C_f$ ), a backward-shift set ( $C_b$ ), and a static set ( $C_s$ ). Typically, $C_f = C_b = \alpha C$ (e.g., $\alpha = 1/8$ ), $C_s = C - 2\alpha C$ .

For fixed $\alpha$ , the shift is realized as follows: $\begin{aligned} Y_f[n,i,t,h,w] &= X_f[n,i,t+1,h,w], \ Y_b[n,i,t,h,w] &= X_b[n,i,t-1,h,w], \ Y_s[n,i,t,h,w] &= X_s[n,i,t,h,w]. \end{aligned}$ Boundary indices are handled via zero padding or replication. The output $Y$ is given by

$Y = \text{concat}(Y_f,\, Y_b,\, Y_s) \in \mathbb{R}^{N \times C \times T \times H \times W}.$

This operation is a pure index permutation—no learnable parameters, no floating point arithmetic, and no effect on the spatial convolution kernel.

Minimal PyTorch-style pseudocode:

def TemporalShift(x, fold, direction='bi'):
    # x: [N, C, T, H, W]
    x_f = x[:, :fold, :, :, :]      # shift forward
    x_b = x[:, fold:2*fold, :, :, :]  # shift backward
    x_s = x[:, 2*fold:, :, :, :]    # no shift
    if direction == 'bi':
        x_f = torch.roll(x_f, shifts=-1, dims=2)
        x_b = torch.roll(x_b, shifts=+1, dims=2)
    return torch.cat([x_f, x_b, x_s], dim=1)

(Lin et al., 2019, Duong et al., 29 Jan 2025)

2. Integration into Neural Network Architectures

TSM is typically inserted immediately before a spatial convolutional operator (e.g., the $3 \times 3$ conv in a ResNet bottleneck block). For a residual block, the shift can be performed either in-place on the main path or as a "residual shift" on the branch only. Both variants are empirically effective, though residual shift is sometimes preferred in CNNs with heavy projection heads (Shen et al., 2023, Lin et al., 2019).

The TSM operation is equally applicable to 1D features (e.g., speech or sequential data, $X \in \mathbb{R}^{T \times C}$ ), to the latent space of diffusion models ( $X \in \mathbb{R}^{B \times F \times C \times H \times W}$ ), and even within Transformer-based models by shifting head-wise or patch-wise tokens for temporal interaction (An et al., 2023, Hashiguchi et al., 2022).

In practice, for vision backbones, the TSM module is applied at every residual block before the main spatial convolution. For speech or sequential models (“ShiftCNN,” “ShiftLSTM,” or “Shiftformer”), it is inserted as dictated by the base layer structure (Shen et al., 2023).

3. Computational Properties

TSM adds no parameters and requires no additional floating-point operations. All computation is reduced to index remapping and memory copy operations, so the FLOPs remain identical to the original backbone. In contrast, 3D convolutions or temporal self-attention introduce $O(T \cdot K_t \cdot K_h \cdot K_w \cdot C_{\mathrm{in}} \cdot C_{\mathrm{out}})$ or $O(F^2 \cdot HW \cdot d)$ FLOPs, respectively, and significant parameter overhead (Lin et al., 2019, An et al., 2023).

Key computational advantages:

Throughput of TSM-augmented networks is nearly double that of 3D CNNs on the same hardware due to optimized 2D CuDNN kernels.
Zero additional parameters, so models scale better in multi-GPU distributed training, with parameter synchronization costs limited to the original 2D backbone (Lin et al., 2019).
On Kinetics-400, TSM-ResNet50 achieves the 74.0% top-1 accuracy in 14 minutes training on 1536 GPUs, maintaining ≈98% parallel efficiency (Lin et al., 2019).

4. Empirical Performance and Ablation Insights

TSM has demonstrated efficacy in large-scale video recognition, action recognition, and sequence modeling tasks.

Selected results:

Backbone	Modality	Top-1	Top-5	Dataset	Reference
TSM-ResNet50	RGB	0.560	0.926	MMVPR	(Duong et al., 29 Jan 2025)
TSM-ResNeSt269	RGB	0.962	1.000	MMVPR	(Duong et al., 29 Jan 2025)
TSM-ResNeSt269	IR	0.986	1.000	MMVPR	(Duong et al., 29 Jan 2025)
Ensemble (3xTSM)	-	1.000	1.000	MMVPR	(Duong et al., 29 Jan 2025)
TSM-RN50 (8f)	-	0.741	-	Kinetics-400	(Lin et al., 2019)
ShiftCNN	Speech	0.748	-	IEMOCAP-UA	(Shen et al., 2023)

Ablation findings across domains:

The optimal shift ratio is generally $\alpha=1/8$ ; smaller ratios under-utilize temporal cues, while larger ones (e.g., $\alpha > 1/4$ ) degrade spatial capacity.
Fewer shifted channels (e.g., 16-30%) suffice; excessive shift reduces representation power for framewise features (Lin et al., 2019, Hashiguchi et al., 2022).
TSM blocks are best inserted evenly across all residual groups for video backbones. For sequence models, later blocks benefit more from mingling (Shen et al., 2023).
“Residual shift” is preferable in CNNs; “in-place shift” is sometimes necessary for RNNs due to projection path semantics (Shen et al., 2023).

5. Generalizations and Adaptations

In diffusion models for text-to-video, a TSM can upgrade a pretrained 2D image U-Net to a video denoiser by applying shift on the residual branch of each ResNet block, with a split factor $g=3$ (An et al., 2023).

For Transformers, TokenShift and MSCA introduce feature shifting at the token or head level. The MSCA-KV variant shifts both keys and values from neighboring frames in a subset of attention heads, achieving higher accuracy than TokenShift or vanilla ViT on Kinetics-400 (e.g., MSCA-KV top-1: 76.47% vs. ViT's 75.65%) at no extra cost (Hashiguchi et al., 2022).

The Learnable Gated Temporal Shift Module (LGTSM) extends TSM with a learnable 1D temporal convolution and a spatial gating mechanism, producing SOTA video inpainting at one-third the parameter count of 3D convolutions (Chang et al., 2019). This suggests that learnable and gated extensions of TSM further improve temporal consistency in generative tasks.

6. Applications and Domain Coverage

TSM’s zero-cost design has led to adoption in:

Large-scale video action recognition (Kinetics, Something-Something) (Lin et al., 2019, Duong et al., 29 Jan 2025).
Real-time, edge-device video understanding (achieving 74 fps on Jetson Nano) (Lin et al., 2018).
Latent-diffusion text-to-video synthesis (An et al., 2023).
Speech emotion recognition with pre-trained representations (“ShiftCNN”, “ShiftLSTM”, “Shiftformer”) (Shen et al., 2023).
Efficient temporal interaction in ViT and MLP-like video networks (Hashiguchi et al., 2022).
Video inpainting with gating and learnable shifts (Chang et al., 2019).

Its modal-agnostic structure enables multi-modal fusion (RGB, IR, depth), as demonstrated in multi-modal competition settings (Duong et al., 29 Jan 2025).

7. Practical Considerations, Limitations, and Research Directions

Best practices distilled from ablations:

Use residual TSM for CNNs and MLPs, in-place for RNNs.
Shift 1/8 of channels forward and 1/8 backward; do not exceed 1/4 total shifted channels.
For ViTs, shift 2 of 12 attention heads per block (≈16.7%) (Hashiguchi et al., 2022).
Downstream models benefit from shift placement in later blocks to maximize mingling while minimizing misalignment (Shen et al., 2023).
In distributed training, minimize frame input ( $T=8$ ) per clip for optimal I/O and maximize hardware scaling (Lin et al., 2019).

Potential extensions include: multi-scale temporal shifts, learned shift parameters (as in LGTSM), combining TSM with attention or convolutional mixers, and applications beyond vision to speech, time-series, and generative modeling (Shen et al., 2023, Chang et al., 2019). TSM's limitations include inability to capture long-range dependencies outside the local shift window and potential misalignment when excessive channels are shifted. However, its computational efficiency and flexibility render it a standard component in efficient temporal modeling pipelines.