Long-Term Feature Bank for Video Understanding
- Long-Term Feature Bank is a memory mechanism that aggregates deep feature maps from video clips to enable detailed video understanding over extended periods.
- LFB integrates long-term context with attention-based or pooling operators to enhance action localization, classification, and egocentric analysis, achieving state-of-the-art performance.
- LFB systems rely on efficient offline computation and storage methods, often using low-rank SVD techniques, to manage the high computational and memory demands.
A Long-Term Feature Bank (LFB) is a memory mechanism designed for detailed video understanding, enabling models to access supportive information extracted over extensive temporal ranges—often spanning minutes of video. By augmenting conventional short-term video recognition backbones, such as 3D convolutional networks, with an explicit, persistent repository of per-clip features, LFB architectures support fine-grained spatio-temporal reasoning critical for tasks like action localization, video classification, and egocentric video analysis. LFBs have been central to achieving state-of-the-art results in large-scale video understanding benchmarks, with their use prominently established by Wu et al. (Wu et al., 2018) and further developments such as low-rank and incremental memory approximations for efficiency (Ntinou et al., 2024).
1. Core Definition and Construction
The LFB is instantiated as a time-indexed sequence of deep feature maps, each summarizing the output of a pretrained backbone (typically a 3D-CNN) from regularly sampled short clips or frames:
- For a video of frames (or clips), indexed by , input segments are encoded via a 3D-CNN , producing downsampled feature maps .
- Candidate actors or regions-of-interest (RoIs) are detected at each time step, yielding bounding boxes , where denotes the number of detections at time .
- RoI pooling (e.g., RoIAlign) is applied to each box, followed by spatial and temporal average pooling, producing per-detection feature vectors .
- This results in , with forming the Long-Term Feature Bank for the entire video.
To ensure efficient storage and computation, the feature bank is typically materialized at regular intervals (e.g., one clip per second), and stored on disk or CPU memory. The dimension is usually large (e.g., ) to capture rich context required by downstream tasks (Wu et al., 2018).
2. Feature Retrieval and Integration
At test or training time, the short-term backbone processes a local window (e.g., 32–64 frames), producing current RoI-pooled features . The LFB is queried for a context window of size centered at :
- The query window is , collecting all detections to form , with .
Feature integration is performed with a Feature Bank Operator (FBO). The standard approach uses a non-local attention mechanism:
where and are learned projectors and is a reduced probe dimension (e.g., ) (Wu et al., 2018). Optionally, simpler pooling operators (average/max) on can be used as a baseline variant.
3. Implementation and Computational Considerations
The LFB pipeline is characterized by a clear separation of long-term (video-wide) and short-term (local clip) computation. Feature extraction for the LFB is fully decoupled and can be performed offline, amortizing cost across deployments:
- Storage and I/O requirements are dominated by the number of detections and embedding size, i.e., floats per video.
- For efficient batching during model training, zero-padding is used so all windows have fixed .
- The memory cost on GPU grows with the window size (due to 's size) but can be bounded by keeping only $2w+1$ time steps in memory at a time.
- For large-scale deployment or longer temporal support, innovations such as low-rank SVD-based memory (e.g., MeMSVD (Ntinou et al., 2024)) compress the LFB to a basis (with ) and singular values, dramatically reducing inference and update complexity.
Pseudocode for training and inference illustrates precomputing the LFB, efficient window selection, and attention-based updates (Wu et al., 2018):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
LFB = [] for t in range(0, T, Δt): f = CNN(load_clip(frames[t-τ//2 : t+τ//2])) detections = DETECTOR(frames[t]) features = [RoIAlign(f, box) for box in detections] LFB.append((t, np.stack(features))) for t in test_times: S_t = extract_short_term_features(t) L̃_t = concatenate_LFB_window(LFB, t, w) U_t = FBO(S_t, L̃_t) scores = classifier(U_t) ... |
4. Empirical Performance and Evaluation
LFB-augmented models yield significant improvements across diverse video understanding benchmarks (Wu et al., 2018), with empirical results demonstrating substantial gains over strong short-term baselines:
| Dataset | Baseline | +LFB (NL, 60s window) | Δ (mAP / top-1) |
|---|---|---|---|
| AVA (mAP) | 22.1 | 25.5 | +3.4 |
| EPIC-Kitchens | 19.0 | 22.8 | +3.8 |
| Charades (mAP) | 38.3 | 40.3 | +2.0 |
- On AVA, a 3D CNN (Res50-I3D-NL) achieves 22.1\% mAP, which rises to 25.5\% with LFB. Enhancement with larger backbones and test-time augmentation yields 27.7\% mAP.
- For EPIC-Kitchens, top-1 action accuracy improves from 19.0\% to 22.8\% (+3.8 points).
- For Charades, mAP increases from 38.3\% to 40.3\% with the addition of LFB.
- Results are computed using mAP (mean Average Precision) for detection/classification, and top-k accuracy for egocentric action recognition.
5. Comparison with Alternative Memory Mechanisms
The conventional LFB approach relies on attention-based integration, but this introduces substantial computational and memory overhead, especially for large window sizes:
- Attention cost per actor/query is , and the entire memory bank must be resident.
- To address these costs, MeMSVD (Ntinou et al., 2024) replaces attention with a low-rank SVD basis. Feature memories are compressed to principal components. Actor features are projected into and reconstructed from this basis, yielding a linear, parameter-free integration:
- MeMSVD achieves similar or superior accuracy with 10–20× fewer FLOPs in the memory integration, constant runtime with increasing window length, and much lower parameter count.
- Incremental updates to the SVD basis can be performed efficiently ( per update), enabling near-real-time adaptation to streaming input.
- On AVA and Charades, MeMSVD matches or outperforms standard attention-LFBs (e.g., 43.0\% mAP on Charades with MViTv2-B 16×4, establishing new state-of-the-art performance).
6. Experimental Design and Protocols
Key experimental details for LFB-based models include:
- Datasets: AVA (spatio-temporal action labels per person at 1 FPS across 15-minute videos), EPIC-Kitchens (action recognition from egocentric viewpoint with verb–noun labels), and Charades (video-level multilabel action recognition).
- Performance metrics: mean Average Precision (mAP) over classes, and top-1/top-5 accuracy as appropriate.
- LFB window size and granularity: empirically, optimal temporal support occurs around seconds (window size 61s), with diminishing returns beyond this.
- Padding and batching: all memory windows are zero-padded to a uniform size for efficient batch processing.
- All LFB computations (including FBO and classifier) are trained end-to-end except for the fixed backbone features in the LFB.
7. Context, Adoption, and Limitations
The LFB has become foundational in video recognition, enabling models to reason over much longer temporal horizons than would be possible with memory- or compute-constrained backbones alone. This design extends the horizon of deep action detectors to minutes of context, supporting nuanced spatio-temporal queries and disambiguating temporally extended actions.
Limitations include the cost of storing and retrieving high-dimensional feature banks for large datasets or long videos, as well as the scalability of attention mechanisms with increasing context length. Compressing the memory (e.g., via low-rank SVD) is effective in alleviating these constraints. A persistent limitation is the reliance on accurate region proposals from person/object detectors, with representational quality of the LFB potentially bottlenecked by detection recall or the diversity of backbone features (Wu et al., 2018, Ntinou et al., 2024).
A plausible implication is that as architectures for video understanding continue to mature, the integration of persistent, scalable feature banks—potentially combined with online or streaming memory mechanisms—will remain a core strategy for leveraging context at scale.