Papers
Topics
Authors
Recent
Search
2000 character limit reached

Long-Term Feature Bank for Video Understanding

Updated 28 December 2025
  • Long-Term Feature Bank is a memory mechanism that aggregates deep feature maps from video clips to enable detailed video understanding over extended periods.
  • LFB integrates long-term context with attention-based or pooling operators to enhance action localization, classification, and egocentric analysis, achieving state-of-the-art performance.
  • LFB systems rely on efficient offline computation and storage methods, often using low-rank SVD techniques, to manage the high computational and memory demands.

A Long-Term Feature Bank (LFB) is a memory mechanism designed for detailed video understanding, enabling models to access supportive information extracted over extensive temporal ranges—often spanning minutes of video. By augmenting conventional short-term video recognition backbones, such as 3D convolutional networks, with an explicit, persistent repository of per-clip features, LFB architectures support fine-grained spatio-temporal reasoning critical for tasks like action localization, video classification, and egocentric video analysis. LFBs have been central to achieving state-of-the-art results in large-scale video understanding benchmarks, with their use prominently established by Wu et al. (Wu et al., 2018) and further developments such as low-rank and incremental memory approximations for efficiency (Ntinou et al., 2024).

1. Core Definition and Construction

The LFB is instantiated as a time-indexed sequence of deep feature maps, each summarizing the output of a pretrained backbone (typically a 3D-CNN) from regularly sampled short clips or frames:

  • For a video of TT frames (or clips), indexed by t=0,,T1t=0,…,T-1, input segments xtRC×τ×H×Wx_t \in \mathbb{R}^{C \times \tau \times H \times W} are encoded via a 3D-CNN φ(;θ)\varphi(\cdot;\theta), producing downsampled feature maps FtF_t.
  • Candidate actors or regions-of-interest (RoIs) are detected at each time step, yielding bounding boxes {bt,1,,bt,Nt}\{b_{t,1},…,b_{t,N_t}\}, where NtN_t denotes the number of detections at time tt.
  • RoI pooling (e.g., RoIAlign) is applied to each box, followed by spatial and temporal average pooling, producing per-detection feature vectors st,jRDs_{t,j} \in \mathbb{R}^D.
  • This results in LtRNt×DL_t \in \mathbb{R}^{N_t \times D}, with L={L0,L1,,LT1}L = \{L_0, L_1, …, L_{T-1}\} forming the Long-Term Feature Bank for the entire video.

To ensure efficient storage and computation, the feature bank is typically materialized at regular intervals Δt\Delta t (e.g., one clip per second), and stored on disk or CPU memory. The dimension DD is usually large (e.g., D=2048D=2048) to capture rich context required by downstream tasks (Wu et al., 2018).

2. Feature Retrieval and Integration

At test or training time, the short-term backbone processes a local window (e.g., 32–64 frames), producing current RoI-pooled features StRNt×DS_t \in \mathbb{R}^{N_t\times D}. The LFB is queried for a context window of size ww centered at tt:

  • The query window is [tw,t+w][t-w, t+w], collecting all detections to form L~tRN×D\widetilde{L}_t \in \mathbb{R}^{N \times D}, with N=t=twt+wNtN=\sum_{t'=t-w}^{t+w} N_{t'}.

Feature integration is performed with a Feature Bank Operator (FBO). The standard approach uses a non-local attention mechanism:

Q=StWq, K=L~tWk, V=L~tWv, A=softmax(QK/dk), St=St+AVWo,\begin{aligned} Q &= S_t W_q, \ K &= \widetilde{L}_t W_k, \ V &= \widetilde{L}_t W_v, \ A &= \mathrm{softmax}\left(Q K^\top/\sqrt{d_k}\right), \ S'_t &= S_t + A V W_o, \end{aligned}

where Wq,Wk,WvRD×dkW_q, W_k, W_v \in \mathbb{R}^{D \times d_k} and WoRdk×DW_o \in \mathbb{R}^{d_k \times D} are learned projectors and dkd_k is a reduced probe dimension (e.g., dk=512d_k = 512) (Wu et al., 2018). Optionally, simpler pooling operators (average/max) on L~t\widetilde{L}_t can be used as a baseline variant.

3. Implementation and Computational Considerations

The LFB pipeline is characterized by a clear separation of long-term (video-wide) and short-term (local clip) computation. Feature extraction for the LFB is fully decoupled and can be performed offline, amortizing cost across deployments:

  • Storage and I/O requirements are dominated by the number of detections and embedding size, i.e., tNtD\sum_t N_t \cdot D floats per video.
  • For efficient batching during model training, zero-padding is used so all windows have fixed NN.
  • The memory cost on GPU grows with the window size (due to L~t\widetilde{L}_t's size) but can be bounded by keeping only $2w+1$ time steps in memory at a time.
  • For large-scale deployment or longer temporal support, innovations such as low-rank SVD-based memory (e.g., MeMSVD (Ntinou et al., 2024)) compress the LFB to a basis UmemRnc×dU_\text{mem} \in \mathbb{R}^{n_c \times d} (with ncNn_c \ll N) and singular values, dramatically reducing inference and update complexity.

Pseudocode for training and inference illustrates precomputing the LFB, efficient window selection, and attention-based updates (Wu et al., 2018):

1
2
3
4
5
6
7
8
9
10
11
12
13
LFB = []
for t in range(0, T, Δt):
    f = CNN(load_clip(frames[t-τ//2 : t+τ//2]))
    detections = DETECTOR(frames[t])
    features = [RoIAlign(f, box) for box in detections]
    LFB.append((t, np.stack(features)))

for t in test_times:
    S_t = extract_short_term_features(t)
    L̃_t = concatenate_LFB_window(LFB, t, w)
    U_t = FBO(S_t, L̃_t)
    scores = classifier(U_t)
    ...

4. Empirical Performance and Evaluation

LFB-augmented models yield significant improvements across diverse video understanding benchmarks (Wu et al., 2018), with empirical results demonstrating substantial gains over strong short-term baselines:

Dataset Baseline +LFB (NL, 60s window) Δ (mAP / top-1)
AVA (mAP) 22.1 25.5 +3.4
EPIC-Kitchens 19.0 22.8 +3.8
Charades (mAP) 38.3 40.3 +2.0
  • On AVA, a 3D CNN (Res50-I3D-NL) achieves 22.1\% mAP, which rises to 25.5\% with LFB. Enhancement with larger backbones and test-time augmentation yields 27.7\% mAP.
  • For EPIC-Kitchens, top-1 action accuracy improves from 19.0\% to 22.8\% (+3.8 points).
  • For Charades, mAP increases from 38.3\% to 40.3\% with the addition of LFB.
  • Results are computed using mAP (mean Average Precision) for detection/classification, and top-k accuracy for egocentric action recognition.

5. Comparison with Alternative Memory Mechanisms

The conventional LFB approach relies on attention-based integration, but this introduces substantial computational and memory overhead, especially for large window sizes:

  • Attention cost per actor/query is O(Nmemdu)O(N_\mathrm{mem} \cdot d_u), and the entire Nmem×dN_\mathrm{mem} \times d memory bank must be resident.
  • To address these costs, MeMSVD (Ntinou et al., 2024) replaces attention with a low-rank SVD basis. Feature memories are compressed to ncNmemn_c \ll N_\mathrm{mem} principal components. Actor features are projected into and reconstructed from this basis, yielding a linear, parameter-free integration:

α=ht,sUmem,h=αUmem,ht,sht,s+h\alpha = h_{t,s} \cdot U_\text{mem}^\top, \qquad h' = \alpha \cdot U_\text{mem}, \qquad h_{t,s} \leftarrow h_{t,s} + h'

  • MeMSVD achieves similar or superior accuracy with 10–20× fewer FLOPs in the memory integration, constant runtime with increasing window length, and much lower parameter count.
  • Incremental updates to the SVD basis can be performed efficiently (O(nc2)O(n_c^2) per update), enabling near-real-time adaptation to streaming input.
  • On AVA and Charades, MeMSVD matches or outperforms standard attention-LFBs (e.g., 43.0\% mAP on Charades with MViTv2-B 16×4, establishing new state-of-the-art performance).

6. Experimental Design and Protocols

Key experimental details for LFB-based models include:

  • Datasets: AVA (spatio-temporal action labels per person at 1 FPS across 15-minute videos), EPIC-Kitchens (action recognition from egocentric viewpoint with verb–noun labels), and Charades (video-level multilabel action recognition).
  • Performance metrics: mean Average Precision (mAP) over classes, and top-1/top-5 accuracy as appropriate.
  • LFB window size and granularity: empirically, optimal temporal support occurs around ±30\pm30 seconds (window size \sim61s), with diminishing returns beyond this.
  • Padding and batching: all memory windows are zero-padded to a uniform size for efficient batch processing.
  • All LFB computations (including FBO and classifier) are trained end-to-end except for the fixed backbone features in the LFB.

7. Context, Adoption, and Limitations

The LFB has become foundational in video recognition, enabling models to reason over much longer temporal horizons than would be possible with memory- or compute-constrained backbones alone. This design extends the horizon of deep action detectors to minutes of context, supporting nuanced spatio-temporal queries and disambiguating temporally extended actions.

Limitations include the cost of storing and retrieving high-dimensional feature banks for large datasets or long videos, as well as the scalability of attention mechanisms with increasing context length. Compressing the memory (e.g., via low-rank SVD) is effective in alleviating these constraints. A persistent limitation is the reliance on accurate region proposals from person/object detectors, with representational quality of the LFB potentially bottlenecked by detection recall or the diversity of backbone features (Wu et al., 2018, Ntinou et al., 2024).

A plausible implication is that as architectures for video understanding continue to mature, the integration of persistent, scalable feature banks—potentially combined with online or streaming memory mechanisms—will remain a core strategy for leveraging context at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Long-Term Feature Bank (LFB).