Sparse Temporal Token Fusion (STTF)

Updated 30 November 2025

Sparse Temporal Token Fusion is an adaptive compression method that reuses token embeddings and only re-encodes regions with significant changes in video data.
It exploits high temporal redundancy to reduce computational load, optimize memory usage, and accelerate processing on resource-constrained edge devices.
Empirical evaluations show up to 84% token reduction and 13× speedup with less than a 5% loss in accuracy compared to dense transformer models.

Sparse Temporal Token Fusion (STTF) is an adaptive compression technique designed for real-time deployment of vision-LLMs (VLMs) on resource-constrained edge devices. STTF leverages the high temporal redundancy present in video and event-based data by conditionally reusing existing token embeddings and re-encoding only those representing regions of significant change. This conditional token update methodology reduces computational overhead, optimizes memory usage, and accelerates latency without substantial loss in task accuracy (Tanvir et al., 23 Nov 2025).

1. Motivation: Temporal Redundancy and Edge Constraints

Edge VLMs for scenarios such as drones or wearables must operate under strict constraints in power, memory, and compute. Classical per-frame transformer encoding is inefficient for streaming visual data due to pronounced temporal redundancy; spatial patches across consecutive frames often remain static, resulting in wasteful recomputation and excessive FLOPs. STTF addresses this by incrementally updating the token set, fusing "stale" tokens with re-encoded ones at each time step.

At any time $t \in {1,\dots,T}$ , the visual input can be:

An RGB frame $x_t \in \mathbb{R}^{3\times H\times W}$ , or
A neuromorphic event tensor $e_t \in \mathbb{R}^{2\times H\times W}$ (with polarity and count).

Each $x_t$ is partitioned into $N$ non-overlapping patches (e.g., $16\times16$ ), each embedded into a $D$ -dimensional vector, yielding $\{x_t^i\}_{i=1}^N, \ x_t^i \in \mathbb{R}^{D}$ .

2. Mathematical Formulation and Fusion Algorithm

Let $x_t = [x_t^1,\dots, x_t^N] \in \mathbb{R}^{N\times D}$ denote the current patch embeddings. Fused token embeddings from the previous timestep are denoted $\hat{x}_{t-1}$ .

Event-driven change detection is performed via the function:

$\phi(x_{t-1}^i, x_t^i) = \Vert x_t^i - x_{t-1}^i \Vert_2 > \tau,$

where $\tau > 0$ is a tunable threshold. This establishes a binary mask $M_t \in \{0,1\}^N$ :

$M_t^i = \begin{cases} 1, & \text{if } \phi(x_{t-1}^i, x_t^i)\ 0, & \text{otherwise} \end{cases}$

A value $M_t^i=1$ indicates the patch must be re-encoded; $M_t^i=0$ signals reuse.

The sparse fusion update is computed as:

$\hat{x}_t = M_t \odot \mathrm{Encoder}(x_t) + (1 - M_t) \odot \hat{x}_{t-1}$

where $\odot$ denotes broadcasted element-wise multiplication.

Pseudocode summary:

Inputs:
    x_t (frame t), optional e_t (event map)
    Previous state: ŷ_{t-1} ∈ ℝ^{N×D}
    Threshold τ
Output:
    ŷ_t ∈ ℝ^{N×D}
Algorithm:
    if t == 1:
        ŷ_1 = Encoder(x_1)
        return ŷ_1
    else:
        [Optional] m_t spatially upsampled from events
        For each token i:
            Δ^i = ‖x_t^i − ŷ_{t-1}^i‖₂
            M_t^i = 1 if Δ^i > τ else 0
        Compute Encoder(x_t) only for M_t^i = 1
        For i in 1..N:
            ŷ_t^i = M_t^i * E_t^i + (1−M_t^i) * ŷ_{t-1}^i
        return ŷ_t

The output token set

\hat{x}_t

is suitable for downstream multi-modal attention and language decoding.

3. Hardware-Aware Implementation and Computational Savings

STTF architecture is tailored for edge hardware:

Memory: Fused token sets $\hat{x}_{t-1}$ are cached in on-chip SRAM/scratchpad for low-latency updates.
Parallelism: Vectorized threshold comparisons ( $\Delta^i > \tau$ ) allow simultaneous mask computation across all tokens. Only the minimal active token list is encoded, maximizing the benefits of token sparsity.
FLOPs reduction: For $N$ tokens and $K_t = \sum_i M_t^i$ refreshed tokens at time $t$ , the relative per-frame FLOPs savings is:

$\mathrm{Savings}_t \approx 1 - \frac{K_t}{N}$

With $K = \mathbb{E}[K_t]$ , total computational load over the sequence is proportional to $(K/N) \cdot \mathrm{FLOPs_{dense}}$ .

4. Empirical Characterization

Extensive evaluation demonstrates substantial gains in token efficiency, accuracy retention, and latency:

DVS128 Gesture (event video):
- Baseline: $N=196$ tokens/frame.
- STTF ( $\tau=0.2$ ): $K \approx 31$ (84% token reduction), accuracy at 95.6% of the dense Vision Transformer (ViT) baseline.
Encoder speedup: $N/K \approx 6.3\times$ for the fusion stage.
End-to-end latency: Up to 13× improvement versus dense ViT+GPT on Jetson Nano hardware.

Metric	Dense ViT	STTF ( $\tau=0.2$ )	Relative
Avg tokens per frame	196	31	$-$ 84%
Recognition accuracy	98.4%	95.6%	$-$ 2.8 pp
Encoder FLOPs per frame	1.0×	0.16×	$-$ 84%
End-to-end latency (ms)	120	9	13× faster

Increasing $\tau$ (i.e., more aggressive token reuse) yields reduced computation at a moderate cost to accuracy; $\tau$ in [0.1, 0.3] typically achieves 80–90% token reduction with less than 5% accuracy decrease.

5. Threshold Selection and Stability Techniques

Optimal operation of STTF depends on careful hyperparameter tuning:

Threshold $\tau$ : Recommended to select $\tau$ as the $90^\mathrm{th}$ percentile of patch embedding changes $\{\Vert x_t^i - x_{t-1}^i \Vert_2\}$ measured on a validation set. Sweeping $\tau$ across $[0.05, 0.5]$ and plotting the resulting trade-off curve for token count vs. accuracy identifies the inflection point for practical deployment.
Stabilization: Applying a momentum update for cached tokens ( $\hat{x}_{t-1}^i \leftarrow \alpha\hat{x}_{t-1}^i + (1-\alpha)\hat{x}_{t-2}^i$ with $\alpha\approx0.9$ ) suppresses re-encoding jitter. Early stopping in training prevents overfitting to sparse updates, and $L_2$ regularization on the sparsity penalty $\|M_t\|_0$ encourages smoother token masks.

6. Summary and Implications

STTF reconceptualizes transformer encoding for video/event vision as an incremental state update, efficiently combining static token reuse and sparse re-encoding according to data-driven change. This framework enables up to 84% FLOPs reduction and $6$– $13\times$ real-time speedup on edge hardware with minimal accuracy penalty ( $<5\%$ ). The approach is compatible with event-driven computer vision, hardware-friendly due to explicit mask logic and on-chip state buffering, and lends itself to further research in adaptive attention and incremental representation (Tanvir et al., 23 Nov 2025). A plausible implication is that STTF can generalize to broader classes of sequential transformer tasks where high temporal redundancy is present, provided stateful token caching and rapid change detection are feasible.

PDF Markdown Chat (Pro)

References (1)

Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse Temporal Token Fusion (STTF).