Video Prediction Transformers without Recurrence or Convolution (2410.04733v3)

Published 7 Oct 2024 in cs.CV

Abstract: Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released at https://github.com/yyyujintang/PredFormer.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that PredFormer significantly reduces MSE by 51.3% on Moving MNIST, 33.1% on TaxiBJ, and 11.1% on WeatherBench compared to previous methods.
The paper introduces innovative Gated Transformer blocks with full, factorized, and interleaved spatial-temporal attention for improved dynamic modeling.
The paper shows that its transformer-based design enhances scalability and efficiency, boosting FPS from 533 to 2364 on TaxiBJ and from 196 to 404 on WeatherBench.

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

The investigation of spatial-temporal predictive learning has evolved considerably with the introduction of PredFormer, a pure transformer-based framework designed to address the constraints of existing models. PredFormer leverages the flexibility and scalability of transformers, overcoming traditional limitations associated with recurrent-based and CNN-based recurrent-free approaches.

Methodological Innovations

PredFormer integrates innovative Gated Transformer blocks inspired by the Vision Transformers (ViT) model, incorporating an extensive analysis of 3D attention mechanisms, including full, factorized, and interleaved spatial-temporal attention. These allow the framework to efficiently model complex spatial and temporal dynamics without relying on recurrent structures or inductive CNN biases known to hinder scalability and generalization.

Experimental Results

PredFormer establishes new performance benchmarks across several datasets, such as Moving MNIST, TaxiBJ, and WeatherBench. The model demonstrates substantial improvements in efficiency and accuracy compared to predecessors like SimVP and TAU. On Moving MNIST, PredFormer achieved a notable 51.3% MSE reduction relative to SimVP. For the TaxiBJ dataset, the framework reduced the MSE by 33.1%, with FPS increasing from 533 to 2364, and on WeatherBench, it decreased MSE by 11.1%, while FPS was improved from 196 to 404.

Implications and Future Applications

The potential applications of PredFormer are far-reaching, extending to real-world tasks such as weather forecasting, traffic flow prediction, and beyond. The framework's robust performance signifies a meaningful shift toward transformer-based models for spatial-temporal prediction, showcasing remarkable adaptability across different spatial and temporal resolutions inherent to diverse datasets. This development sets the stage for further refinement in capturing complex dependencies, a critical aspect of predictive learning tasks.

Conclusion

PredFormer introduces a significant advancement in spatial-temporal predictive learning, providing an efficient and scalable transformer-based solution that sets new benchmarks in accuracy and computational performance. Its design paves the way for future studies in AI that can explore even more expansive applications, driving forward the capabilities of predictive modeling frameworks in diverse, real-world scenarios.