Two-Stream Transformer Architecture for Long Video Understanding (2208.01753v1)

Published 2 Aug 2022 in cs.CV, cs.LG, and cs.MM

Abstract: Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.

PDF Abstract

The paper "Two-Stream Transformer Architecture for Long Video Understanding" addresses some foundational challenges in using pure vision transformer architectures for long video classification and action recognition. Typically, transformers excel at short video tasks due to their ability to capture intricate dependencies; however, their quadratic complexity in self-attention mechanisms, combined with a lack of inductive bias, render them resource-intensive and inefficient when extended to longer videos.

Recognizing these limitations, the authors introduce an innovative Spatio-Temporal Attention Network (STAN), employing a two-stream transformer architecture. This new architecture is designed to efficiently model dependencies between static spatial features from individual frames and dynamic temporal context across frames. Here’s a detailed breakdown of their approach:

Two-Stream Transformer Architecture: The proposed architecture distinctly processes spatial and temporal features using separate transformer streams. The spatial stream focuses on capturing static image features from video frames, while the temporal stream deals with the evolving context over time.
Efficiency and Scalability: STAN manages to effectively handle videos up to two minutes in length using a single GPU. This efficiency stems from both architectural design choices and optimized processing techniques that mitigate the high memory and computational demands typically associated with transformer models.
Data Efficiency: One of the standout features of STAN is its data-efficient nature. This is critical since training transformers generally requires substantial amounts of data, which can be a bottleneck in many practical applications, especially when dealing with extensive video datasets.
State-of-the-Art Performance: The authors demonstrate that their approach not only addresses the scalability and efficiency issues but also achieves state-of-the-art (SOTA) performance on several long video understanding tasks. This highlights the practical significance and potential impact of their work in advancing video understanding capabilities.

In summation, the paper presents a significant advancement in the domain of video processing by tackling the inherent limitations of conventional transformers through a novel two-stream architecture that promises both improved efficiency and performance for long video tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Edward Fish (7 papers)
Jon Weinbren (4 papers)
Andrew Gilbert (44 papers)

Citations (6)

View on Semantic Scholar

Two-Stream Transformer Architecture for Long Video Understanding (2208.01753v1)

Related Papers