Temporal-Guided VFM (TGVFM) Overview

Updated 16 November 2025

The paper introduces TGVFM, a framework that integrates temporal reasoning into VFMs via a novel Temporal Context Fusion Block.
It employs long-range temporal attention, dual spatiotemporal attention, and deep feature guidance to effectively fuse asynchronous event data.
Empirical results show significant gains with +16% MIoU in segmentation, improved depth and detection metrics, underscoring its cross-modality impact.

Temporal-Guided Visual Foundation Models (TGVFM) refer to a framework for event-based vision that augments transformer-based Visual Foundation Models (VFMs) with dedicated temporal reasoning modules. Designed for asynchronous event streams from event cameras, TGVFM enables state-of-the-art semantic segmentation, depth estimation, and object detection by introducing a specialized temporal context fusion block (TCFB) into existing VFM backbones. By efficiently integrating spatiotemporal cues from event streams and leveraging pretrained visual representations, TGVFM addresses the challenge of bridging event-based modalities with large-scale, image-pretrained VFMs (Xia et al., 9 Nov 2025).

1. Motivation and Problem Domain

Event cameras output asynchronous streams of sparse events, providing microsecond-scale temporal information and high dynamic range. However, standard VFMs are built for synchronous RGB images and are thus suboptimal for processing long, complex temporal sequences inherent to event data. Prior efforts fall into two categories: (i) purpose-built architectures for event streams (e.g., HMNet, EReFormer), or (ii) reconstructing frames from events and forwarding them through image VFMs, which typically overlook temporal consistency and underperform on dynamic scenes.

TGVFM is designed to harness pretrained spatial representations of state-of-the-art VFMs (such as Rein for segmentation, Metric3D for depth, and Swin+Cascade R-CNN for detection) and to inject temporal modeling capacity via modular temporal blocks. This approach preserves spatial priors while enabling event-specific temporal fusion, requiring neither bespoke training regimens nor extensive labeled event datasets.

2. Temporal Context Fusion Block (TCFB): Architecture and Mathematical Formulation

The core component of TGVFM is the Temporal Context Fusion Block, strategically inserted between Vision Transformer (ViT) blocks in the VFM backbone. Each TCFB comprises three submodules:

2.1. Long-Range Temporal Attention (LTA):

For each spatial location $(h, w)$ in the feature map $f_t \in \mathbb{R}^{H \times W \times C}$ at time $t$ , LTA aggregates representations across a window of $k$ past frames.
Historical features: $f_{t:t-k}^{h,w} = [f_t^{h,w}, f_{t-1}^{h,w}, \ldots, f_{t-k}^{h,w}] \in \mathbb{R}^{(k+1) \times C}$ .
Linear projections yield $(Q, K, V)$ , e.g. $Q = W^Q f_{t:t-k}^{h,w} \in \mathbb{R}^{(k+1) \times d}$ .
Self-attention and residual update:

$\hat{f}_t^{h,w} = \mathrm{Softmax}\left( \frac{QK^\top}{\sqrt{d}} \right)V + f_t^{h,w}$

2.2. Dual Spatiotemporal Attention (DSA):

(a) Inter-Frame Cross-Attention: Queries from $f_{t-1}$ , keys/values from $f_t$ , and residual update as in standard transformers.
(b) Local Window Self-Attention: For token $(h, w)$ in $f_t$ , a spatial $\delta$ -sized window $\Omega_{t-1}^{h,w}$ in the previous frame is aggregated and attention is computed:

$Q, K, V = W^Q, W^K, W^V \left([f_t^{h,w}, \Omega_{t-1}^{h,w}]\right)$

2.3. Deep Feature Guidance Mechanism (DFGM):

To improve semantic stability, decoder feature maps $\mathbf{F}_{t-1:t-k} \in \mathbb{R}^{H \times W \times C}$ from past $k$ frames are projected and added to shallow temporal features:

$\widetilde{f}_{t-1:t-k} = f_{t-1:t-k} + \mathbf{G}_{t-1:t-k}$

These fused features are subsequently used for all downstream temporal operations.

Each TCFB is wrapped in a zero-init linear residual layer:

$\widetilde{f}_{\text{out}} = f_{\text{out}}^{\text{(VFM)}} + \mathrm{Linear}_{0\text{-init}}\left(\text{TCFB}(f_\text{in})\right)$

This design preserves the behavior of pretrained weights at initialization and allows a smooth induction of temporal reasoning as training progresses.

3. Event-to-Frame Conversion and Input Preparation

The pipeline for converting asynchronous event data to TGVFM input proceeds as follows:

Voxel-Grid Representation: Incoming events in $[t-\Delta, t]$ are binned into $C$ temporal intervals, yielding $e_t \in \mathbb{R}^{H \times W \times C}$ by accumulating event polarities per bin.
E2VID Reconstruction: Modified E2VID (ConvLSTM/GRU U-Net) takes $e_t$ and previous hidden state $s_{t-1}$ to output a grayscale frame $i_t$ :

$(i_t, s_t) = f_{\text{E2VID}}(e_t, s_{t-1}), \quad i_t \in \mathbb{R}^{H \times W}$

Sequential Input: Stacked $i_t$ frames are processed through the TGVFM backbone, with a memory bank $\mathcal{M}$ maintaining past features for temporal fusion at TCFB layers.

Empirical results show that direct E2VID frames outperform raw voxel grids or time-surface inputs by over 5% MIoU, demonstrating the importance of lossless event-to-frame transformation.

TGVFM executes hierarchical temporal fusion via repeated TCFBs at multiple levels:

Backbone Models: Segmentation (Rein, ViT-S/B), depth (Metric3D, ViT-S/B), detection (Swin-S + Cascade R-CNN).
Block Placement: Four TCFBs are interleaved among 12 ViT blocks, distributing temporal reasoning across early, mid, and late feature hierarchies.
Memory Bank: At each TCFB, the last $k$ features are buffered and shared among all TCFB invocations, limiting memory usage and computational overhead.
Parameter Sharing: All TCFB modules use shared parameters; this reduces the memory footprint by ~75% with a negligible performance drop (less than 0.2% MIoU).

The unified memory mechanism enables TGVFM to balance computational efficiency and temporal depth, as ablation paper results indicate diminishing returns from wider memory windows ( $k>3$ ).

5. Training Objectives, Cross-Modality Distillation, and Implementation

5.1. Loss Functions:

Segmentation: Pixel-wise cross-entropy,

$\mathcal{L}_{\text{seg}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\log p_{i,c}$

Depth: Scale-invariant log loss (SiLog),

$g_i = \log d_i - \log \hat{d}_i;\quad \mathcal{L}_\text{depth} = \sqrt{\frac{1}{n}\sum_i g_i^2 - \frac{\lambda}{n^2} \left( \sum_i g_i \right)^2}$

Detection: Cascade R-CNN losses combining binary cross-entropy for classification and $\ell_1$ for bounding box regression,

$\mathcal{L}_\text{det} = \sum_{k=1}^{K} \left( \mathcal{L}_{\text{cls}}^{(k)} + \mathcal{L}_{\text{box}}^{(k)} \right)$

5.2. Cross-Modality Distillation:

A teacher VFM (RGB-based) generates pseudo-labels for reconstructed frames ( $i_t$ ).
The student network uses L1 loss for segmentation probabilities, SiLog for depth, and filtered detection losses, enabling event-trained models to inherit the semantic richness of VFMs.

5.3. Implementation Details:

Hyperparameters: $C=5$ voxel bins, E2VID window $\Delta=50$ ms, memory window $k=3$ , 4 TCFBs among 12 ViT blocks.
Training: 40K iterations for TGVFM, 50K for E2VID retraining, both with batch size 2 on a single NVIDIA L40S GPU, AdamW optimizer, warmup + cosine decay.
Code available in modular form: event preprocessing, TCFB modules, backbone wrappers, task-specific training/evaluation scripts.

6. Empirical Results and Ablations

6.1. Datasets and Metrics:

DSEC benchmark for urban driving with events, RGB, and LiDAR for either full or distilled supervision.
Evaluation metrics: MIoU for segmentation, $\delta_{1,2,3}$ /REL/RMS for depth, mAP/AP50/AP75/AP $_{S/M/L}$ for detection.

6.2. Quantitative Improvements:

Task/Metric	Baseline	TGVFM (Best)	Relative Gain
Segmentation (MIoU)	59.94 (ECDDP)	69.12	+16%
Depth (REL)	0.111	0.092	–17.1%
Detection (mAP)	38.0	47.7	+9.7%

Nighttime segmentation gains +21% MIoU vs. CMDA; depth REL decreases by 20%; detection improves by 16% on medium objects.
Ablations show cumulative improvements from LTA (+2.01 MIoU), DSA (+1.93), DFGM (+2.45), and all combined (+3.13).
Performance saturates for E2VID B3 (6.8M params); direct E2VID outperforming raw event input.

6.3. Implementation-Efficient Features:

Zero-init residuals stabilize convergence and preserve pretrained spatial representations during early optimization.
Shared TCFB parameters and bounded memory windows deliver strong performance with efficient resource usage.

7. Significance and Broader Context

TGVFM bridges the gap between event-based and frame-based vision by allowing image-pretrained VFMs to operate effectively on asynchronous, temporally dense event data. Its plug-and-play modularity allows insertion into existing VFM pipelines without major retraining or architectural redesign. The framework’s empirical superiority across segmentation, depth, and detection on DSEC demonstrates both its flexibility and effectiveness.

This approach opens possibilities for further research on multi-timescale temporal aggregation, parameter-efficient temporal reasoning, and distillation between high-performing image and event domains. This suggests the broader cross-modality potential of TGVFM-type architectures, particularly in environments requiring both temporal acuity and semantic richness, such as robotics, autonomous driving, and industrial inspection.

Source code and pretrained models are provided at https://github.com/XiaRho/TGVFM (Xia et al., 9 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Temporal-Guided Visual Foundation Models for Event-Based Vision (2025)

Follow Topic

Get notified by email when new papers are published related to Temporal-Guided VFM (TGVFM).