Multiview Transformers for Video Recognition
The paper, "Multiview Transformers for Video Recognition," presents a novel approach to video understanding by leveraging transformer architectures. It introduces the Multiview Transformers for Video Recognition (MTV), which models videos at different spatiotemporal resolutions using separate encoders for multiple views of the input video. The key innovation lies in the lateral connections among these encoders, enabling fusion of information across multiple views and enhancing the understanding of complex temporal dynamics in video data.
Model Architecture
MTV builds upon the ViViT model by introducing multiview tokenization. In essence, the model extracts tokens over different temporal durations, forming a multiscale representation of the video. Different spatial and temporal scales are processed through transformer encoders of varying capacities, optimized for their respective view sizes. This multiview encoder includes lateral connections to fuse information efficiently, contrasting with previous pyramid-based approaches by offering direct multiscale context processing without subsampling.
Cross-View Fusion
The paper explores several methods for cross-view fusion within the transformer architecture. The paper finds Cross-View Attention (CVA) to be particularly effective, allowing information transfer between different resolutions. This results in a more efficient architecture capable of retaining fine-grained temporal details while processing large amounts of data in parallel.
Experimental Validation
Extensive experiments are conducted on multiple datasets including Kinetics 400, 600, and 700, Moments in Time, Epic-Kitchens-100, and Something-Something V2. The model achieves state-of-the-art accuracy across these datasets, demonstrating superior performance in terms of accuracy/computation trade-offs compared to existing methods such as ViViT and SlowFast. For instance, on Kinetics 400, MTV outperforms other baselines with substantial computational efficiency gains.
Results and Implications
MTV consistently achieves higher accuracy and efficiency, demonstrating scalability from "Small" to "Huge" model variants. Notably, the model excels when leveraging large-scale pretraining datasets, such as JFT and Weak Textual Supervision (WTS), further enhancing its performance.
The implications of this research are significant in advancing video recognition tasks. MTV's ability to effectively model and infer over varied temporal resolutions makes it particularly useful for applications involving complex and dynamic scenarios. This approach opens avenues for more sophisticated video understanding systems in fields such as autonomous driving, surveillance, and human-computer interaction.
Future Directions
While the results are impressive, the paper highlights potential limitations and future research paths, including reducing the reliance on large-scale pretraining and exploring extensions to other multiscale transformer architectures like MViT and Swin. These directions promise further enhancements to both the efficiency and applicability of multiview video models in diverse real-world applications.