Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Selective Spatiotemporal Vision Transformer (SSViT)

Updated 16 November 2025
  • SSViT is a vision architecture that leverages selective spatiotemporal attention to process dynamic visual data with high computational efficiency.
  • It integrates biologically inspired spiking neural networks with deep learning techniques for tasks like image classification and modulo video recovery.
  • The design reduces complexity by employing innovative token selection and self-attention modules, achieving superior accuracy with lower memory and FLOPs.

The Selective Spatiotemporal Vision Transformer (SSViT) refers to a class of architectures designed for efficient, high-accuracy spatiotemporal modeling in vision tasks. Two principal and distinct instantiations have been developed: one tailored for spiking neural networks with biological inspiration for edge computing, and another for deep learning-driven modulo video recovery using token selection strategies. Both directions are unified by their core strategy of selectively attending to crucial spatiotemporal regions, either via spike-driven mechanisms or through data-driven token selection, thereby achieving strong performance with significant computational and memory gains.

1. Architectural Foundations

Spiking SSViT (SNN-ViT with SSSA)

The spiking variant of SSViT is built upon two components:

  • Global–Local Spiking Patch Splitting (GL-SPS): Transforms raw images into multi-scale spiking feature maps, partitioning input into patches suitable for spiking processing.
  • Stacked Spiking Transformer Blocks: Each block includes a Saccadic Spike Self-Attention (SSSA) module, followed by a channel-wise MLP layer.

The architecture processes image data hierarchically in a 4-stage pyramid. At each subsequent stage, the spatial resolution halves and the channel dimension typically doubles. Data flows through the structure as: Input Image → GL-SPS → SSSA-Block → MLP → next stage.

Within each stage, tokens representing spatial patches across TT timesteps and DD channels are processed by SSSA for spatiotemporal mixing, then passed through the MLP. The spiking neuron follows the Leaky Integrate-and-Fire (LIF) model: U[t+1]=τU[t]+X[t+1]S[t]VresetU[t + 1] = \tau U[t] + X[t + 1] - S[t] V_\text{reset}

S[t+1]=Θ(U[t+1]Vth)S[t + 1] = \Theta(U[t + 1] - V_\text{th})

where UU is membrane potential, XX synaptic input, τ\tau decay constant, VthV_\text{th} threshold, SS spike, and Θ\Theta is the Heaviside step.

Deep Learning SSViT for Modulo Video

In modulo video recovery, SSViT takes a window of low-bit observations produced by modulo cameras, with the goal of reconstructing the underlying high-dynamic-range (HDR) content. The pipeline consists of:

  • Preprocessing: Sliding window of nc+1n_c + 1 consecutive A-bit frames, extraction of folding masks M[x,y,c](k)\mathcal{M}^{(k)}_{[x,y,c]} and fold counts L[x,y,c]L[x,y,c].
  • Encoder: Shared CNN-style encoder produces a 4D feature map, further split into spatiotemporal tubes and projected into tokens.
  • Token Selection: Intricate regions are located using a 3D Neighboring Similarity Matrix (NSM), and only tokens corresponding to highest NSM scores are processed by the Transformer backbone.
  • Transformer: Joint space-time attention is performed over the selected tokens (from the target frame) and all tokens from supporting frames (for context).
  • Decoder: Patchwise binary mask prediction for folding recovery.

Notably, all positional embeddings are omitted, justified empirically by the data-driven nature of the mask classification task.

2. Selective Attention Mechanisms

Saccadic Spike Self-Attention (SSSA)

Vanilla dot-product self-attention fails with spike-based, binary and sparse representations, due to magnitude fluctuations. SSSA replaces this via spike distribution-based spatial relevance:

  • Each D-dimensional spike vector is modeled as a Bernoulli process (pqp_q for Query, pkp_k for Key).
  • Cross-entropy between q,kq, k:

H(q,k)=[pqlogpk+(1pq)log(1pk)]H(q, k) = - \left[ p_q \log p_k + (1 - p_q) \log(1 - p_k) \right]

The silent-period term is dropped, and loglog is approximated linearly for spike rates near 0.1–0.2.

  • For all tokens:

Q(t,i)=d=1DQ[t,i,d];K(t,j)=d=1DK[t,j,d]Q'(t, i) = \sum_{d=1}^D Q[t, i, d]; \quad K'(t, j) = \sum_{d=1}^D K[t, j, d]

The cross-attention becomes:

CroAtt(Q,K)[t,i,j]Q(t,i)K(t,j)\text{CroAtt}(Q, K)[t, i, j] \approx Q'(t, i) \cdot K'(t, j)

Saccadic Interaction Module

Inspired by biological saccades, the attention mechanism dynamically selects spatial locations at each timestep using:

  • Salient-patch scoring:

PatchSalience[t,i]=j=1NCroAtt(Q,K)[t,i,j]\text{PatchSalience}[t, i] = \sum_{j=1}^{N} \text{CroAtt}(Q, K)[t, i, j]

  • Temporal salience accumulation via a learnable lower-triangular matrix MwRT×TM_w \in \mathbb{R}^{T \times T}:
    • Training: H=MwPatchSalienceH = M_w \cdot \text{PatchSalience}, S=Θ(HVth)S = \Theta(H - V_\text{th})
    • Inference: Gating depends only on current PatchSalience\text{PatchSalience} and a dynamically adjusted threshold.

Token Selection via 3D NSM

For deep learning-based SSViT, regions likely to require non-trivial recovery are identified by heterogeneity in local features:

  • NSM combines Kullback–Leibler divergence between softmax of local features and a uniform distribution, with average cosine dissimilarity:

NSM=DKL+DcosNSM = D_{KL} + D_{cos}

  • Only the top-N~\tilde{N} tokens by average NSM are chosen for full attention, focusing resources where folding ambiguity is highest.

3. Computational Efficiency and Complexity Analysis

Spiking SSViT (SNN-ViT)

  • Baseline self-attention has O(N2D)O(N^2D) or higher complexity.
  • SSSA exploits distribution-kernel factorizations, reducing to O(ND+TN)O(D)O(ND + TN) \approx O(D) when TDT \ll D.
  • SSSA-V2 further linearizes computation by compressing kernel operations and thresholding, maintaining full spatial-temporal selectivity without quadratic overhead.

Modulo Video SSViT

  • Token selection avoids processing all nn tokens, operating on N~n\tilde{N} \ll n tokens at full Transformer cost O(N~2)O(\tilde{N}^2).
  • Unselected tokens are handled by warping predicted masks with FlowNet2-inferred optical flow, circumventing repeat attention calculations.
  • Feature encodings are cached, as frames are encoded once per iteration.
Method/Architecture Complexity Memory/FLOPs Saving
Spiking SSViT (SSSA-V2) O(D)O(D) Full spike-driven linear scaling
Modulo Video SSViT (Selection) O(N~2)O(\tilde{N}^2) %%%%28L[x,y,c]L[x,y,c]29%%%% vs. full attention

4. Experimental Performance and Benchmarking

Spiking SSViT

  • Image Classification (CIFAR100, T=4):

SNN-ViT achieves 80.1% accuracy with 5.6M parameters and O(D)O(D) complexity, outperforming Spikformer (78.2%, 9.3M, O(N2D)O(N^2D)) and Spike-driven ViTs (78.4%, 10.3M, O(ND)O(ND)).

  • ImageNet-1K (T=4):

SNN-ViT-8-512 achieves 80.2% Top-1 accuracy (O(D)O(D)) with 35.8 mJ energy per sample, competitive with standard ViT-12-768 (77.9%, O(N2D)O(N^2D), 80.9 mJ).

  • Remote-sensing Detection:

SNN-ViT as backbone in YOLO-v3 improves [email protected] on SSDD from 94.8 to 96.7% at T=4T=4.

Modulo Video SSViT

  • Datasets:

Tested on LiU (12 sequences, 1280×720) and HdM (upto 1920×1080).

  • Metrics:
    • PSNR (LiU/HdM): SSViT 28.85 / 29.38, besting UnModNet (27.71 / 28.03), Uformer (11.38 / 14.36), MRF (12.43 / 15.84).
    • SSIM: SSViT 0.871/0.850, UnModNet 0.811/0.824, Uformer 0.482/0.535.
  • Qualitative Assessment:

In high-dynamic scenes, SSViT recovers fold edges and detail which prior methods tend to lose.

5. Training and Optimization Methodologies

  • Spiking SSViT:
    • Supervised cross-entropy loss on final membrane potentials.
    • Surrogate gradients such as Θ(x)exp(x2/β)\Theta'(x) \simeq \exp(-x^2/\beta) are applied for backpropagation through the Heaviside spiking function.
    • Regularization via weight decay on MwM_w and scheduling of VthV_{th} or τ\tau.
  • Modulo Video SSViT:
    • Standard cross-entropy loss is applied to binary folding mask prediction, accumulated over selected tokens at each iteration.
    • Training is end-to-end with Adam optimizer (lr=1×104lr=1\times10^{-4}), 200k iterations, nc=4n_c=4 clip length.
    • Outputs are temporally tone-mapped with a smoothed Reinhard operator.

6. Ablation Studies and Analytical Insights

  • Incorporation of SSSA (spatial attention) improves CIFAR100 accuracy by +1.39%+1.39\%; GL-SPS (patch split) alone yields +0.67%+0.67\%.
  • Both SSSA and GL-SPS combined (with O(D)O(D) complexity) yield +1.89%+1.89\% over baseline.
  • Micro-ablation of SSSA: replacing the distribution kernel with dot-product or the saccadic neuron with LIF both reduce performance (–0.48% and –0.76%, respectively).
  • Linearized SSSA-V2 delivers comparable accuracy to quadratic SSSA-V1.
Variant Params Complexity CIFAR100 Acc.
Spikformer (Baseline) 9.32M O(N2D)O(N^2D) 78.21%
+ SSSA (spatial only) 5.52M O(N2D)O(N^2D) 79.60%
+ GL-SPS (patch only) 5.81M O(N2D)O(N^2D) 77.88%
+ Both 5.57M O(D)O(D) 80.10%

7. Extensions, Limitations, and Future Work

SSViT’s selective spatiotemporal approaches are amenable to other modalities, such as event-based audio and radar, or in scenarios requiring efficient high-dimensional attention. Biological inspiration (the saccade) can generalize to sequence data beyond vision. Practical hardware acceleration is an active direction, especially given the O(D)O(D) complexity and event-driven flow in the spiking architecture.

Current limitations include:

  • Reliance on multi-timestep inference (T4T \approx 4–8).
  • Fixed patch grids; dynamic saccade regions are yet to be realized.
  • MwM_w’s lower-triangular constraint may not capture more general temporal relations.
  • For modulo recovery, explicit positional encoding is omitted because tasks do not require absolute coordinates, but this could be re-examined for scene-dependent masks.

Future directions include:

  • Learnable dynamic patch grids for adaptive ROI selection in saccadic attention.
  • Hardware-software co-design to exploit event-driven, linear complexity models.
  • Hybrid SNN-ANN stacks, where shallow SNN front-ends produce sparse, saccade-driven masks for downstream ANN modules.

A plausible implication is that selective, biologically inspired spatiotemporal attention offers a systematic path towards both energy and computational efficiency in domains where attention to sparse, informative regions is critical. The demonstrated state-of-the-art results in both spiking vision tasks and in modulo video recovery indicate broad applicability for SSViT architectures under resource constraints (Wang et al., 18 Feb 2025, Geng et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Selective Spatiotemporal Vision Transformer (SSViT).