Papers
Topics
Authors
Recent
2000 character limit reached

TinyGPT-STTF: Efficient Vision-Language Model

Updated 30 November 2025
  • TinyGPT-STTF is a 3B-parameter vision-language model that integrates Sparse Temporal Token Fusion with a sparsified ViT encoder and lightweight GPT decoder for edge device applications.
  • It employs event-driven gating and token memory to dynamically reuse unaltered tokens, significantly reducing computational load and energy consumption.
  • Benchmarks demonstrate its efficiency with up to 17.6 CIDEr improvement, 62× fewer FLOPs, and 13× lower latency compared to larger multimodal models.

TinyGPT-STTF is a 3B-parameter vision-LLM designed for efficient real-time inference on edge devices. The model integrates two innovations—Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (discussed in the relevant work but not in the STTF variant)—to achieve substantial reductions in parameter count, memory footprint, and computational cost, while maintaining or surpassing the accuracy of much larger models. TinyGPT-STTF combines a sparsified ViT encoder, a lightweight GPT-style decoder, event-driven gating and cross-frame token reuse, as well as a suite of hardware-aware optimizations tailored for low-power system-on-chip deployments (Tanvir et al., 23 Nov 2025).

1. Model Structure and Architectural Scaling

TinyGPT-STTF employs a modular dual-encoder-decoder design, uniting a sparse visual transformer (ViT) encoder and a compact GPT-derived text decoder ("MicroGPT"). Architectural modifications include aggressive scaling of hidden dimensions, MLP widths, and network depth relative to standard ViT-B/16 and GPT-2 stacks, supported by targeted gating mechanisms and persistent token memory:

  • Visual Encoder (SparseViT):
    • 12 Transformer encoder layers with hidden dimension 512, MLP inner dimension 2048, and 8 attention heads (head-dim 64).
    • Patch embedding converts 16×1616 \times 16 RGB patches into 512-dimensional tokens; learnable 2D positional embeddings applied.
    • EventGateCNN front-end (4 convolutional layers) generates a binary change mask mt∈{0,1}H×Wm_t \in \{0,1\}^{H \times W}.
  • Language Decoder (MicroGPT):
    • 12 Transformer decoder blocks, hidden dimension 512, MLP inner dimension 2048, 8 attention heads.
    • Text token embeddings are 512-dimensional with shared positional encodings.
  • Cross-modal Temporal Fusion:
    • Decoder blocks use masked cross-attention between text queries and dynamically-selected vision tokens ztz_t, modulated by the change mask.
  • TokenMemory and Gating:
    • At each encoder block ll, a gating network G(l)G^{(l)} compares current and previous frame token vectors:
    • Δxt(l)=xt(l)−xt−1(l)\Delta x_t^{(l)} = x_t^{(l)} - x_{t-1}^{(l)},
    • gt(l)=σ(W(l)⋅Δxt(l)+b(l))g_t^{(l)} = \sigma(W^{(l)} \cdot \Delta x_t^{(l)} + b^{(l)}) for token-wise gating,
    • enabling unchanged tokens to be directly reused from the past.

Relative to a vanilla ViT + GPT stack, TinyGPT-STTF incorporates reduced hidden dimensions (512 vs. 768), shorter sequences (12 vs. 24–32 layers), event-driven gating and fusion at every encoder layer, and a persistent token memory for temporal reuse.

2. Sparse Temporal Token Fusion Algorithm

STTF is the critical innovation enabling conditional computation and frame-adaptive sparsity. STTF executes the following workflow at each video or event frame tt:

  • Change Detection (EventGateCNN):

An event map ete_t is processed by EventGateCNN to output a binary mask mtm_t, signaling active regions for encoding. Only tokens corresponding to active regions are considered for further computation.

  • Layer-wise Gating & Token Fusion:

For every layer ll and token ii: Δxt,i(l)=xt,i(l)−xt−1,i(l)\Delta x_{t,i}^{(l)} = x_{t,i}^{(l)} - x_{t-1,i}^{(l)} gt,i(l)=σ(Wg(l)⋅Δxt,i(l)+bg(l))g_{t,i}^{(l)} = \sigma(W_g^{(l)} \cdot \Delta x_{t,i}^{(l)} + b_g^{(l)}) zt,i(l)=gt,i(l)⊙xt,i(l)+(1−gt,i(l))⊙zt−1,i(l)z_{t,i}^{(l)} = g_{t,i}^{(l)} \odot x_{t,i}^{(l)} + (1 - g_{t,i}^{(l)}) \odot z_{t-1,i}^{(l)} where σ(⋅)\sigma(\cdot) is the sigmoid, Wg(l)W_g^{(l)} and bg(l)b_g^{(l)} are small learnable parameters (d=512d=512), and zz is the fused token.

  • End-to-End Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
# Inputs: x_t = RGB image, e_t = event map, y = text tokens, s_{t-1} = previous state {z_{t-1}, m_{t-1}}
m_t = EventGateCNN(e_t)
P_t = extract_patches_where(m_t == 1)
for l in range(12):
    for token i in P_t:
        delta = x_t[i] - x_{t-1}[i]
        gate = sigmoid(W_g[l] @ delta + b_g[l])
        if gate > tau:
            z_t[i] = EncoderBlock[l](x_t[i])
        else:
            z_t[i] = z_{t-1}[i]
# Cross-modal attention and decoding omitted for clarity.
Typically, τ\tau is set low (e.g., $0.1$), encouraging ψ-shaped sparsity patterns.

This process allows tokens from static or low-change regions to persist across frames, eliminating redundant computation.

3. Hardware-Aware Optimizations

TinyGPT-STTF is engineered for mobile SoCs and embedded platforms through layered quantization and custom execution strategies:

  • Low-Precision Quantization:
    • 8-bit symmetric quantization-aware training (QAT) for weights and activations (per-channel scaling).
    • Gate weights Wg(l)W_g^{(l)} are quantized to 8 bits, eliminating the need for high-precision computation during gating.
    • Integer-only fallback for inference to maximize throughput.
  • Conditional Execution:
    • Sparse multi-head attention (MHA) kernels dynamically process only "active" tokens as dictated by gating.
    • Kernel fusion executes token gathering, QKV projection, and softmax in one pass, reducing overhead.
  • TokenMemory Compression:
    • TokenMemory is stored in 8-bit representations, only decompressed for regions marked as changed.
    • Results in up to 70% reduction in peak activation memory versus dense ViT; only ~16% of tokens active per frame on average.
  • Optimized Matrix Multiplication:
    • Grouped GEMM dispatches process only the active rows, substantially improving cache usage (by 4–6 times).
  • Measured On-Device Resource Use:
    • 62× fewer FLOPs per frame (~1.2×10121.2 \times 10^{12} vs. 7.4×10137.4 \times 10^{13} for LLaVA-1.5 7B).
    • Up to 13× lower inference latency; 5–15× reduced energy per frame.

4. Empirical Performance and Benchmarks

Extensive quantitative evaluation demonstrates strong efficiency and accuracy across multiple tasks and datasets:

  • COCO 2017 Test Set (Karpathy Split):

| Model | Params | CIDEr | BLEU-4 | METEOR | ROUGE-L | |----------------------|:------:|:------:|:------:|:------:|:-------:| | LLaVA-1.5 7B | 7 B | 113.6 | 0.35 | 0.28 | – | | TinyGPT-STTF | 3 B | 131.2 | 0.38 | 0.31 | 0.56 |

TinyGPT-STTF achieves a 17.6-point CIDEr gain with 2.3× fewer parameters and 62× fewer FLOPs per frame than LLaVA-1.5 7B.

  • DVS128 Gesture Recognition:
    • Reduces average tokens from 196 to 31 per frame (84% reduction).
    • Maintains 95.6% recognition accuracy (comparable to dense baseline).
    • Inference speedup of 6.1×.
  • Low-Motion Scenes Comparison (ANC baseline):
    • ~90% reduction in FLOPs for event density <5%.
    • Accuracy within 1–2% of full model.

5. Comparative Evaluation and Practical Significance

Relative to established baseline methods (dense ViT-GPT, LLaVA-7B, BLIP-2 Vicuna-7B), TinyGPT-STTF offers the following empirical advantages:

  • Absolute CIDEr improvement of >17.6 points over LLaVA-1.5 7B.
  • 2.3× reduction in parameter footprint (3B vs. 7B).
  • 62× reduction in on-device FLOPs per frame.
  • 13× reduction in end-to-end inference latency on mobile SoCs.
  • 5–15× lower energy usage for sustained operation.
  • 84% reduction in tokens for video/event-based streams, without compromising gesture recognition accuracy (maintaining 95.6%).

These outcomes demonstrate that careful architectural design, in combination with event-driven token reuse and gating, can yield near state-of-the-art (SOTA) multimodal performance within the computational and energy budgets of edge devices.

6. Applications and Deployment Context

TinyGPT-STTF addresses real-time vision-language inference on resource-constrained hardware, such as smartphones, embedded cameras, and mobile robots. Its design enables practical deployment scenarios where battery life and memory are primary constraints, including:

  • Continuous video captioning and stream analysis.
  • Gesture recognition from event-based camera data.
  • Natural language multi-modal interaction on portable or always-on devices.

A plausible implication is that this architectural approach enables the next generation of edge-AI deployments, opening domains for multimodal models previously constrained by hardware limits (Tanvir et al., 23 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TinyGPT-STTF.