Lightweight Temporal Fusion Architecture
- Lightweight temporal fusion architecture is a neural network design that aggregates temporal features with drastically reduced parameters and FLOPs.
- It employs factorized computations, adaptive channel selection, and sparse fusion techniques to maintain performance in constrained environments.
- Applications span real-time video recognition, streaming perception, and edge deployment, demonstrating significant speedups and parameter efficiency.
A lightweight temporal fusion architecture is a neural module or network design that achieves temporal feature aggregation, cross-frame modeling, or video sequence understanding with sharply reduced parameter count, FLOPs, or memory usage compared to standard 3D convolutional, transformer-based, or optical flow-based counterparts. Such methods are motivated by constraints in edge deployment, real-time inference, or large-scale streaming settings, where temporal richness must be retained at minimal computational and storage cost. Recent research explores a spectrum of techniques across 2D CNNs, 3D ConvNets, transformers, BEV perception, multimodal fusion, spiking neural networks, and video-language-action models.
1. Principles of Lightweight Temporal Fusion
Lightweight temporal fusion architectures are characterized by the following unifying principles:
- Factorization of Temporal and Spatial Computation: Replace heavy 3D (spatiotemporal) operations with factorized or separable processing, e.g., decoupling per-frame spatial processing from temporal aggregation via simple fusion modules or low-dimensional attention.
- Reuse and Caching of Features: Store and reuse intermediate features from past frames or time steps, leveraging redundancy and temporal consistency.
- Policy-Driven or Adaptive Channel Selection: Apply channel-wise selection/suppression/skipping, often driven by small policy networks, to minimize redundant computation across time.
- Sparse or Graph-Based Fusion: Restrict temporal or spatial-temporal aggregation to a local region or sampled graph structure to avoid quadratic attention or all-pairs processing.
- Lightweight Attention or State Models: Employ low-rank, compact attention mechanisms, recurrent state-space updates, or adapters with tiny parameter footprints.
- Minimal Additive Parameter/Compute Overhead: Any temporal fusion module must add only a small fraction of the parameters and computational workload of the base model, often measured at <1% to <10%.
These strategies enable efficient temporal modeling for applications in action recognition (Meng et al., 2021), video fusion (Zhao et al., 5 Feb 2026), streaming perception (Li et al., 2022), 3D occupancy (Yu et al., 21 Feb 2025), transformer-based forecasting (Li et al., 2021), and more.
2. Core Methodologies and Architectures
Contemporary lightweight temporal fusion architectures adopt a variety of structural motifs, each corresponding to distinct domains and resource constraints.
A. Factorized and Separable Approaches
- Fully Separable Block (FSB): Decomposes standard 3D convolutions into stacked 1D temporal, 2D spatial, and 1×1×1 pointwise convolutions, drastically reducing parameters and multiply-accumulate counts while improving video recognition accuracy (Wang et al., 2019).
- Temporal Residual Gradient (TRG): Explicitly computes and integrates motion-difference features at minimal cost, enabling shallow architectures to capture motion-relevant cues (Wang et al., 2019).
B. Feature Reuse and Adaptive Channel Fusion
- AdaFuse Modules: Channel-wise adaptive fusing of current and historical convolution output, using Gumbel-softmax policy networks that select (per-channel, per-frame) among "keep," "reuse," or "skip" states to minimize workload while preserving critical motion information (Meng et al., 2021).
C. Plug-in Fusion Blocks for Streaming or Online Perception
- LongShortNet LSFM: Dual-path network splits short-term per-frame semantic encoding and long-term temporal feature caching, then merges multi-FPN-scale features by lightweight 1×1 convolution and channel-concatenation. This enables real-time streaming detection with minimal added computation (<1ms/frame), achieving accuracy gains on Argoverse-HD (Li et al., 2022).
- Sparse4D Recurrent Temporal Fusion: Sparse propagation of instance-level features across frames with constant per-frame update cost. Image-derived semantics are preserved and projected; structured anchor parameters are lightweightously updated. No quadratic cost in time horizon; memory and speed are independent of temporal history length (Lin et al., 2023).
D. Lightweight Transformer-Based Methods
- ST-TIS: Spatial-Temporal Transformer for forecasting uses an information fusion module and a sparse, sampled region graph to limit attention computation to O(n√n) while capturing joint spatial-temporal dependencies. Multi-level transformer attention leverages fused embeddings for accurate, scalable traffic prediction with 0.14M parameters and 95% training time reduction (Li et al., 2021).
- MambaVF: Video fusion interprets alignment as a spatio-temporal state-space model (SSM) scan, eschewing explicit flow/warping. The Vision State Space (VSS) block operates bidirectionally along 8 scan paths, with per-token SSM updates of form , producing linear complexity and <1M parameters—yielding up to 92% parameter reduction and 2.1× speedup over flow-based SOTA (Zhao et al., 5 Feb 2026).
E. Token-Based/Adapter-Based Temporal Fusion in Transformers
- CFBT (Cross Fusion RGB-T Tracking): Employs three minimal cross-stream modules:
- CSTAF (cross spatio-temporal attention) for template fusion,
- CSTCF (complementarity fusion) for search branch,
- DSTA (dual-stream adapter), a bottleneck down–project–up adapter, used for inter-branch temporal fusion within a transformer encoder. Parameter overhead is capped at <0.3%, yet empirical tracking benchmarks establish new SOTA (Zeng et al., 2024).
SwiftVLA (4D-aware VLA for Robotics): A compact VLM is augmented at training time with a frozen 4D visual geometry branch and Fusion Tokens supervised by future-prediction objectives. A mask-and-reconstruct strategy distills 4D knowledge into the 2D stream; at inference, all 4D infrastructure is dropped, yielding edge-inference latencies comparable to 2D-only models with the performance of much larger multi-modal baselines (Ni et al., 30 Nov 2025).
F. SNN Temporal Fusion
- Fused-Kernel SNN (Temporal Fusion SNNs): Fuses the layer/step unrolling of spiking neuron recurrences so all time-steps are processed in a single loop per layer on-GPU. Temporality is encoded by register-resident state updates, reducing kernel launches by ×T, improving memory locality, and yielding 5–40× training speedups at identical accuracy (Li et al., 2024).
G. Tri-Stream and Attention-Based Multiview Temporal Correlation
- OccLinker: Fuses static, historical, and motion cues via three parallel lightweight multi-head attention streams; each outputs tokens that are projected to the occupancy grid and combined with a single 3×3×3 convolution. Only 0.5M parameters are added, and inference overhead is negligible, unlike heavy 4D vision baselines (Yu et al., 21 Feb 2025).
3. Complexity and Parameter Efficiency
A defining feature is sharp reductions in FLOPs, model size, and compute-bound latency, without sacrificing temporal modeling power. Representative empirical results are provided below.
| Method / Domain | Params (M) | FLOPs (G) | Relative Speedup | Accuracy Δ | Reference |
|---|---|---|---|---|---|
| MambaVF (Video fusion) | 0.71 | 8.77 | 2.1× | SOTA match | (Zhao et al., 5 Feb 2026) |
| ST-TIS (Forecasting) | 0.14 | — | 95% faster train | -9.5% RMSE vs best | (Li et al., 2021) |
| AdaFuse+TSN (Act.Rec.) | — | 22.1 | 40% saving | -1% | (Meng et al., 2021) |
| LongShortNet (Real-time det.) | +~0.5 | +0.6 | — | +1.0% sAP | (Li et al., 2022) |
| CFBT (RGB-T Track) | <0.3% ovh | — | — | +1.5% SR | (Zeng et al., 2024) |
| SwiftVLA (VLA-robot) | 0.45 | — | 18× edge runtime | +17 pts SR | (Ni et al., 30 Nov 2025) |
| Sparse4Dv2 (3D BEV) | — | — | FPS ×2+, mem –50% | +9.8 mAP | (Lin et al., 2023) |
Complexity reductions arise from selective channel fusion, factorization, tokenization, SSM scanning, or graph sampling (e.g., O(n√n) vs O(n²) in ST-TIS).
4. Training and Inference Characteristics
Most designs ensure nearly zero algorithmic approximation error relative to their larger baselines, as they do not sacrifice receptive field or introduce lossy quantization. Surrogate gradients (for SNNs), explicit trajectory or feature reconstruction losses (SwiftVLA, CFBT), or skip connections (AdaFuse, LongShortNet) are widely used.
Parameter increments over baseline CNNs or transformers are typically <1M, often <0.5M, compared to 10–100M+ for standard temporal fusion modules.
Many methods offer stateless fusion at inference:
- SwiftVLA, by mask-and-reconstruct pretraining, can drop the entire 4D branch.
- OccLinker computes and adds correction occupancy in parallel with base networks.
- LongShortNet and Sparse4Dv2 require only the storage of historical feature buffers, manageable on edge hardware.
5. Applications and Experimental Evidence
Lightweight temporal fusion is domain-agnostic, spanning:
- Real-time video recognition and action detection (Meng et al., 2021, Wang et al., 2019)
- Streaming and automotive perception (Li et al., 2022, Lin et al., 2023)
- Multimodal fusion for tracking and robotics (Zeng et al., 2024, Ni et al., 30 Nov 2025)
- Remote sensing and change detection (see LCD-Net, (Liu et al., 2024))
- Occupancy prediction and scene completion (Yu et al., 21 Feb 2025)
- State-space video fusion (MambaVF) (Zhao et al., 5 Feb 2026)
- Traffic prediction over urban spatiotemporal graphs (Li et al., 2021)
- GPU-efficient SNN training (Li et al., 2024)
Ablation studies consistently demonstrate that even minimalist fusion blocks (LSFM, CSTAF, DSTA, token-level SSM, AdaFuse, etc.) can recover most or all of the performance of heavy temporal pipelines, sometimes even improving strict metrics (e.g. small-object AP, temporal consistency, mIoU, SOTA VIF/SSIM). For example, LongShortNet achieves a +1.0% absolute sAP gain over StreamYOLO-L at only +0.11ms/frame (Li et al., 2022); MambaVF reduces FLOPs by 88.8% over UniVF with equivalent or better perceptual metrics (Zhao et al., 5 Feb 2026); CFBT increases SR by 2.3 pts at <0.3% headroom (Zeng et al., 2024).
6. Limitations and Future Directions
Lightweight temporal fusion methods entail several inherent or context-specific limitations:
- No explicit spatial alignment: LSFM and AdaFuse, among others, may be sensitive to fast, non-uniform motion or viewpoint changes, and could benefit from lightweight cross-attention or deformable modeling.
- Uniform history sampling: Fixed-lag or windowed fusion may not capture irregular dynamics; adaptive or attention-weighted history selection is an open direction.
- Specialized to backbone: Some adaptors or policy networks assume CNN or transformer backbone specifics and must be retuned for new architectures.
- Trade-off surface: While compute/accuracy trade-offs are tunable (e.g., regularization weight λ in AdaFuse), aggressive pruning can impact fine-grained action or motion modeling.
Open problems include adapting these techniques to self-supervised/unsupervised domains, further reducing resource demands via quantization, and extending plug-in temporal fusion to emerging architectures (graph, diffusion, etc.).
7. Representative Algorithms and Pseudocode
Common algorithmic skeletons include:
- FSB Block (PyTorch-like) (Wang et al., 2019)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class FSBBlock(nn.Module): def __init__(self, C_in, C_out, K, R, S, alpha=1): # Temporal conv self.conv_t = nn.Conv3d(C_in, alpha*C_out, kernel_size=(K,1,1), bias=False) # Spatial conv (grouped) self.conv_s = nn.Conv3d(alpha*C_out, alpha*C_out, kernel_size=(1,R,S), groups=alpha*C_out, bias=False) self.conv_p = nn.Conv3d(alpha*C_out, C_out, kernel_size=1, bias=False) def forward(self, x): x = hFA_1D(self.conv_t, x) x = relu(x) x = hFA_2D(self.conv_s, x) x = relu(x) x = self.conv_p(x) return x |
- AdaFuse Fusion Block (Meng et al., 2021)
1 2 3 4 5 6 |
def AdaFuse(x_t, y_t, y_tminus1): # channelwise fusion v = global_avg_pool(x_t) policy = MLP(concat(v, prev_v)) # outputs [C,3] logits action = gumbel_softmax(policy) y_fused = action[:,:,0]*y_t + action[:,:,1]*y_tminus1 return y_fused |
- Fused SNN LIF Kernel (Li et al., 2024)
1 2 3 4 5 6 7 8 9 |
__global__ void fusedForwardLIF(float *X, float *Vout, int *Yout, ...) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
float v = V_rest;
for (int t=0; t<T; ++t) {
int spike = (v >= V_th) ? 1 : 0;
Yout[i*T + t] = spike;
Vout[i*T + t] = v = k_tau*v*(1 - spike) + V_rest*spike + X[i*T + t];
}
} |
- ST-TIS Two-Hop Transformer (Graph Attention Pruned to O(n√n)) (Li et al., 2021):
- Multi-head attention only to √n-1 neighbors + hub, with two-hop coverage.
These algorithms are typically plug-in modules requiring minimal change to base model training or architecture.
The spectrum of approaches reviewed demonstrates that lightweight temporal fusion is essential for scalable, real-time, and edge-deployable temporal models. Rational design of temporal fusion modules, sparsity, and selective use of historical information are key to delivering efficient yet accurate solutions across video, robotics, streaming perception, and multimodal domains (Zhao et al., 5 Feb 2026, Li et al., 2022, Meng et al., 2021, Li et al., 2021, Yu et al., 21 Feb 2025, Li et al., 2024, Ni et al., 30 Nov 2025, Zeng et al., 2024, Wang et al., 2019, Lin et al., 2023).