SimVP: Simpler yet Better Video Prediction (2206.05099v1)

Published 9 Jun 2022 in cs.CV and cs.AI

Abstract: From CNN, RNN, to ViT, we have witnessed remarkable advancements in video prediction, incorporating auxiliary inputs, elaborate neural architectures, and sophisticated training strategies. We admire these progresses but are confused about the necessity: is there a simple method that can perform comparably well? This paper proposes SimVP, a simple video prediction model that is completely built upon CNN and trained by MSE loss in an end-to-end fashion. Without introducing any additional tricks and complicated strategies, we can achieve state-of-the-art performance on five benchmark datasets. Through extended experiments, we demonstrate that SimVP has strong generalization and extensibility on real-world datasets. The significant reduction of training cost makes it easier to scale to complex scenarios. We believe SimVP can serve as a solid baseline to stimulate the further development of video prediction. The code is available at \href{https://github.com/gaozhangyang/SimVP-Simpler-yet-Better-Video-Prediction}{Github}.

Citations (167)

View on Semantic Scholar

Summary

SimVP: Simpler yet Better Video Prediction

This paper introduces SimVP, a video prediction model that challenges the conventional wisdom in the field by prioritizing simplicity without sacrificing performance. Historically, video prediction models have evolved from using complex architectures like RNNs, CNNs paired with RNNs, and more recently Vision Transformers (ViT). These methods often require intricate designs and advanced training strategies to tackle the spatial and temporal complexities inherent in video data.

Overview and Contribution

SimVP is designed entirely with convolutional neural networks (CNNs) and leverages mean squared error (MSE) loss for training. The goal here is to determine whether a simpler architecture can yield state-of-the-art results across various video prediction benchmarks. SimVP accomplishes this by effectively capturing both spatial and temporal features using CNNs without the need for additional tricks or convoluted training schemes.

Comparison with Existing Methods

The paper categorizes existing video prediction models into four primary architectural designs:

RNN-RNN-RNN: Predominantly utilizes repetitions of RNN layers for both spatial and temporal modeling, seen in models like PredRNN and its variants.
CNN-RNN-CNN: Combines CNN encoders and decoders with RNNs for temporal dynamics, as used in CrevNet and E3D-LSTM.
CNN-ViT-CNN: Introduces transformers within a CNN framework to model video sequences.
CNN-CNN-CNN: Solely relies on CNN architectures for both spatial and temporal predictions, which have been less explored in favor of complex models until SimVP emerges.

SimVP is positioned within the last category (CNN-CNN-CNN) and attempts to do what other models achieve with higher architectural complexity.

Results

SimVP demonstrates competitive performance across five benchmark datasets, significantly reducing training costs — a noteworthy advantage which facilitates scalability to more complex datasets and scenarios. The paper presents quantitative results showing that SimVP outperforms several state-of-the-art models in MSE while also maintaining a favorable computational footprint. For instance, on Moving MNIST, SimVP achieves an MSE of 23.8 and an SSIM of 0.948, putting it on par with or slightly better than models like CrevNet and PhyDNet.

Implications and Future Direction

The simplification of video prediction architectures as demonstrated by SimVP has profound implications. Firstly, it questions the necessity of increasingly complex models and sets a precedent for evaluating the performance gains due to complexity. Researchers might be encouraged to revisit CNN-based methods with renewed interest, focusing on efficiency and scalability rather than architectural novelty alone.

Theoretical implications might also spur a renewed exploration of CNNs' abilities to capture temporal dependencies, possibly driving innovations in designing CNN variants that streamline spatio-temporal feature extraction more effectively.

Conclusion

SimVP serves as a potent reminder of the overlooked potential of simpler architectures in video prediction tasks. It provides a valuable baseline that may inform future research directions, emphasizing clarity, ease of use, and a solid foundation for achieving high-quality predictions without excessive complexity in modeling. This paper encourages the academic community to reexamine assumptions about the complexity needed for effective video prediction, potentially stimulating more diverse explorations in model design.