Papers
Topics
Authors
Recent
2000 character limit reached

Sparse Temporal Token Fusion (STTF)

Updated 30 November 2025
  • Sparse Temporal Token Fusion is an adaptive compression method that reuses token embeddings and only re-encodes regions with significant changes in video data.
  • It exploits high temporal redundancy to reduce computational load, optimize memory usage, and accelerate processing on resource-constrained edge devices.
  • Empirical evaluations show up to 84% token reduction and 13× speedup with less than a 5% loss in accuracy compared to dense transformer models.

Sparse Temporal Token Fusion (STTF) is an adaptive compression technique designed for real-time deployment of vision-LLMs (VLMs) on resource-constrained edge devices. STTF leverages the high temporal redundancy present in video and event-based data by conditionally reusing existing token embeddings and re-encoding only those representing regions of significant change. This conditional token update methodology reduces computational overhead, optimizes memory usage, and accelerates latency without substantial loss in task accuracy (Tanvir et al., 23 Nov 2025).

1. Motivation: Temporal Redundancy and Edge Constraints

Edge VLMs for scenarios such as drones or wearables must operate under strict constraints in power, memory, and compute. Classical per-frame transformer encoding is inefficient for streaming visual data due to pronounced temporal redundancy; spatial patches across consecutive frames often remain static, resulting in wasteful recomputation and excessive FLOPs. STTF addresses this by incrementally updating the token set, fusing "stale" tokens with re-encoded ones at each time step.

At any time t1,,Tt \in {1,\dots,T}, the visual input can be:

  • An RGB frame xtR3×H×Wx_t \in \mathbb{R}^{3\times H\times W}, or
  • A neuromorphic event tensor etR2×H×We_t \in \mathbb{R}^{2\times H\times W} (with polarity and count).

Each xtx_t is partitioned into NN non-overlapping patches (e.g., 16×1616\times16), each embedded into a DD-dimensional vector, yielding {xti}i=1N, xtiRD\{x_t^i\}_{i=1}^N, \ x_t^i \in \mathbb{R}^{D}.

2. Mathematical Formulation and Fusion Algorithm

Let xt=[xt1,,xtN]RN×Dx_t = [x_t^1,\dots, x_t^N] \in \mathbb{R}^{N\times D} denote the current patch embeddings. Fused token embeddings from the previous timestep are denoted x^t1\hat{x}_{t-1}.

Event-driven change detection is performed via the function:

ϕ(xt1i,xti)=xtixt1i2>τ,\phi(x_{t-1}^i, x_t^i) = \Vert x_t^i - x_{t-1}^i \Vert_2 > \tau,

where τ>0\tau > 0 is a tunable threshold. This establishes a binary mask Mt{0,1}NM_t \in \{0,1\}^N:

Mti={1,if ϕ(xt1i,xti) 0,otherwiseM_t^i = \begin{cases} 1, & \text{if } \phi(x_{t-1}^i, x_t^i)\ 0, & \text{otherwise} \end{cases}

A value Mti=1M_t^i=1 indicates the patch must be re-encoded; Mti=0M_t^i=0 signals reuse.

The sparse fusion update is computed as:

x^t=MtEncoder(xt)+(1Mt)x^t1\hat{x}_t = M_t \odot \mathrm{Encoder}(x_t) + (1 - M_t) \odot \hat{x}_{t-1}

where \odot denotes broadcasted element-wise multiplication.

Pseudocode summary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Inputs:
    x_t (frame t), optional e_t (event map)
    Previous state: ŷ_{t-1} ∈ ℝ^{N×D}
    Threshold τ
Output:
    ŷ_t ∈ ℝ^{N×D}
Algorithm:
    if t == 1:
        ŷ_1 = Encoder(x_1)
        return ŷ_1
    else:
        [Optional] m_t spatially upsampled from events
        For each token i:
            Δ^i = ‖x_t^i − ŷ_{t-1}^i‖₂
            M_t^i = 1 if Δ^i > τ else 0
        Compute Encoder(x_t) only for M_t^i = 1
        For i in 1..N:
            ŷ_t^i = M_t^i * E_t^i + (1−M_t^i) * ŷ_{t-1}^i
        return ŷ_t
The output token set x^t\hat{x}_t is suitable for downstream multi-modal attention and language decoding.

3. Hardware-Aware Implementation and Computational Savings

STTF architecture is tailored for edge hardware:

  • Memory: Fused token sets x^t1\hat{x}_{t-1} are cached in on-chip SRAM/scratchpad for low-latency updates.
  • Parallelism: Vectorized threshold comparisons (Δi>τ\Delta^i > \tau) allow simultaneous mask computation across all tokens. Only the minimal active token list is encoded, maximizing the benefits of token sparsity.
  • FLOPs reduction: For NN tokens and Kt=iMtiK_t = \sum_i M_t^i refreshed tokens at time tt, the relative per-frame FLOPs savings is:

Savingst1KtN\mathrm{Savings}_t \approx 1 - \frac{K_t}{N}

With K=E[Kt]K = \mathbb{E}[K_t], total computational load over the sequence is proportional to (K/N)FLOPsdense(K/N) \cdot \mathrm{FLOPs_{dense}}.

4. Empirical Characterization

Extensive evaluation demonstrates substantial gains in token efficiency, accuracy retention, and latency:

  • DVS128 Gesture (event video):
    • Baseline: N=196N=196 tokens/frame.
    • STTF (τ=0.2\tau=0.2): K31K \approx 31 (84% token reduction), accuracy at 95.6% of the dense Vision Transformer (ViT) baseline.
  • Encoder speedup: N/K6.3×N/K \approx 6.3\times for the fusion stage.
  • End-to-end latency: Up to 13× improvement versus dense ViT+GPT on Jetson Nano hardware.
Metric Dense ViT STTF (τ=0.2\tau=0.2) Relative
Avg tokens per frame 196 31 -84%
Recognition accuracy 98.4% 95.6% -2.8 pp
Encoder FLOPs per frame 1.0× 0.16× -84%
End-to-end latency (ms) 120 9 13× faster

Increasing τ\tau (i.e., more aggressive token reuse) yields reduced computation at a moderate cost to accuracy; τ\tau in [0.1, 0.3] typically achieves 80–90% token reduction with less than 5% accuracy decrease.

5. Threshold Selection and Stability Techniques

Optimal operation of STTF depends on careful hyperparameter tuning:

  • Threshold τ\tau: Recommended to select τ\tau as the 90th90^\mathrm{th} percentile of patch embedding changes {xtixt1i2}\{\Vert x_t^i - x_{t-1}^i \Vert_2\} measured on a validation set. Sweeping τ\tau across [0.05,0.5][0.05, 0.5] and plotting the resulting trade-off curve for token count vs. accuracy identifies the inflection point for practical deployment.
  • Stabilization: Applying a momentum update for cached tokens (x^t1iαx^t1i+(1α)x^t2i\hat{x}_{t-1}^i \leftarrow \alpha\hat{x}_{t-1}^i + (1-\alpha)\hat{x}_{t-2}^i with α0.9\alpha\approx0.9) suppresses re-encoding jitter. Early stopping in training prevents overfitting to sparse updates, and L2L_2 regularization on the sparsity penalty Mt0\|M_t\|_0 encourages smoother token masks.

6. Summary and Implications

STTF reconceptualizes transformer encoding for video/event vision as an incremental state update, efficiently combining static token reuse and sparse re-encoding according to data-driven change. This framework enables up to 84% FLOPs reduction and $6$–13×13\times real-time speedup on edge hardware with minimal accuracy penalty (<5%<5\%). The approach is compatible with event-driven computer vision, hardware-friendly due to explicit mask logic and on-chip state buffering, and lends itself to further research in adaptive attention and incremental representation (Tanvir et al., 23 Nov 2025). A plausible implication is that STTF can generalize to broader classes of sequential transformer tasks where high temporal redundancy is present, provided stateful token caching and rapid change detection are feasible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sparse Temporal Token Fusion (STTF).