Sparse Temporal Token Fusion (STTF)
- Sparse Temporal Token Fusion is an adaptive compression method that reuses token embeddings and only re-encodes regions with significant changes in video data.
- It exploits high temporal redundancy to reduce computational load, optimize memory usage, and accelerate processing on resource-constrained edge devices.
- Empirical evaluations show up to 84% token reduction and 13× speedup with less than a 5% loss in accuracy compared to dense transformer models.
Sparse Temporal Token Fusion (STTF) is an adaptive compression technique designed for real-time deployment of vision-LLMs (VLMs) on resource-constrained edge devices. STTF leverages the high temporal redundancy present in video and event-based data by conditionally reusing existing token embeddings and re-encoding only those representing regions of significant change. This conditional token update methodology reduces computational overhead, optimizes memory usage, and accelerates latency without substantial loss in task accuracy (Tanvir et al., 23 Nov 2025).
1. Motivation: Temporal Redundancy and Edge Constraints
Edge VLMs for scenarios such as drones or wearables must operate under strict constraints in power, memory, and compute. Classical per-frame transformer encoding is inefficient for streaming visual data due to pronounced temporal redundancy; spatial patches across consecutive frames often remain static, resulting in wasteful recomputation and excessive FLOPs. STTF addresses this by incrementally updating the token set, fusing "stale" tokens with re-encoded ones at each time step.
At any time , the visual input can be:
- An RGB frame , or
- A neuromorphic event tensor (with polarity and count).
Each is partitioned into non-overlapping patches (e.g., ), each embedded into a -dimensional vector, yielding .
2. Mathematical Formulation and Fusion Algorithm
Let denote the current patch embeddings. Fused token embeddings from the previous timestep are denoted .
Event-driven change detection is performed via the function:
where is a tunable threshold. This establishes a binary mask :
A value indicates the patch must be re-encoded; signals reuse.
The sparse fusion update is computed as:
where denotes broadcasted element-wise multiplication.
Pseudocode summary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Inputs:
x_t (frame t), optional e_t (event map)
Previous state: ŷ_{t-1} ∈ ℝ^{N×D}
Threshold τ
Output:
ŷ_t ∈ ℝ^{N×D}
Algorithm:
if t == 1:
ŷ_1 = Encoder(x_1)
return ŷ_1
else:
[Optional] m_t spatially upsampled from events
For each token i:
Δ^i = ‖x_t^i − ŷ_{t-1}^i‖₂
M_t^i = 1 if Δ^i > τ else 0
Compute Encoder(x_t) only for M_t^i = 1
For i in 1..N:
ŷ_t^i = M_t^i * E_t^i + (1−M_t^i) * ŷ_{t-1}^i
return ŷ_t |
3. Hardware-Aware Implementation and Computational Savings
STTF architecture is tailored for edge hardware:
- Memory: Fused token sets are cached in on-chip SRAM/scratchpad for low-latency updates.
- Parallelism: Vectorized threshold comparisons () allow simultaneous mask computation across all tokens. Only the minimal active token list is encoded, maximizing the benefits of token sparsity.
- FLOPs reduction: For tokens and refreshed tokens at time , the relative per-frame FLOPs savings is:
With , total computational load over the sequence is proportional to .
4. Empirical Characterization
Extensive evaluation demonstrates substantial gains in token efficiency, accuracy retention, and latency:
- DVS128 Gesture (event video):
- Baseline: tokens/frame.
- STTF (): (84% token reduction), accuracy at 95.6% of the dense Vision Transformer (ViT) baseline.
- Encoder speedup: for the fusion stage.
- End-to-end latency: Up to 13× improvement versus dense ViT+GPT on Jetson Nano hardware.
| Metric | Dense ViT | STTF () | Relative |
|---|---|---|---|
| Avg tokens per frame | 196 | 31 | 84% |
| Recognition accuracy | 98.4% | 95.6% | 2.8 pp |
| Encoder FLOPs per frame | 1.0× | 0.16× | 84% |
| End-to-end latency (ms) | 120 | 9 | 13× faster |
Increasing (i.e., more aggressive token reuse) yields reduced computation at a moderate cost to accuracy; in [0.1, 0.3] typically achieves 80–90% token reduction with less than 5% accuracy decrease.
5. Threshold Selection and Stability Techniques
Optimal operation of STTF depends on careful hyperparameter tuning:
- Threshold : Recommended to select as the percentile of patch embedding changes measured on a validation set. Sweeping across and plotting the resulting trade-off curve for token count vs. accuracy identifies the inflection point for practical deployment.
- Stabilization: Applying a momentum update for cached tokens ( with ) suppresses re-encoding jitter. Early stopping in training prevents overfitting to sparse updates, and regularization on the sparsity penalty encourages smoother token masks.
6. Summary and Implications
STTF reconceptualizes transformer encoding for video/event vision as an incremental state update, efficiently combining static token reuse and sparse re-encoding according to data-driven change. This framework enables up to 84% FLOPs reduction and $6$– real-time speedup on edge hardware with minimal accuracy penalty (). The approach is compatible with event-driven computer vision, hardware-friendly due to explicit mask logic and on-chip state buffering, and lends itself to further research in adaptive attention and incremental representation (Tanvir et al., 23 Nov 2025). A plausible implication is that STTF can generalize to broader classes of sequential transformer tasks where high temporal redundancy is present, provided stateful token caching and rapid change detection are feasible.