Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Fourier Inception Networks for Occluded Video Prediction (2306.10346v1)

Published 17 Jun 2023 in cs.CV

Abstract: Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 802–810.
  2. H. Wu, Z. Yao, J. Wang, and M. Long, “Motionrnn: A flexible model for video prediction with spacetime-varying motions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 435–15 444.
  3. X. Bei, Y. Yang, and S. Soatto, “Learning semantic-aware dynamics for video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 902–912.
  4. L. Castrejón, N. Ballas, and A. C. Courville, “Improved conditional vrnns for video prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 7607–7616.
  5. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  6. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” in Proceedings of the International Conference on Machine Learning (ICML), 2018, pp. 1182–1191.
  7. J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
  8. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  9. Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 879–888.
  10. Y. Wang, L. Jiang, M. Yang, L. Li, M. Long, and L. Fei-Fei, “Eidetic 3d LSTM: A model for video prediction and beyond,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  11. Z. Chang, X. Zhang, S. Wang, S. Ma, Y. Ye, X. Xiang, and W. Gao, “Mau: A motion-aware unit for video prediction and beyond,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
  12. Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4473–4481.
  13. H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell, “Disentangling propagation and generation for video prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 9005–9014.
  14. Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3160–3170.
  15. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  16. W. Yu, Y. Lu, S. Easterbrook, and S. Fidler, “Efficient and information-preserving future frame prediction and beyond,” in Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  17. Z. Wang, Z. Liu, G. Li, Y. Wang, T. Zhang, L. Xu, and J. Wang, “Spatio-temporal self-attention network for video saliency prediction,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 1161–1174, 2023.
  18. L. Chi, B. Jiang, and Y. Mu, “Fast fourier convolution,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  19. Y. Katznelson, “An introduction to harmonic analysis,” The American Mathematical Monthly, vol. 77, no. 4, 2005.
  20. R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 3172–3182.
  21. N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” in Proceedings of the International Conference on Machine Learning (ICML), 2015, pp. 843–852.
  22. J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction,” in Proceedings of the AAAI conference on artificial intelligence(AAAI), 2017, pp. 1655–1661.
  23. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 36, no. 7, pp. 1325–1339, 2014.
  24. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research (IJRR), vol. 32, no. 11, pp. 1231–1237, 2013.
  25. C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the Computational Vision and Active Perception Laboratory (CVAP), 2004, pp. 32–36.
  26. Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2018, pp. 5110–5119.
  27. Y. Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P. S. Yu, “Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9154–9162.
  28. M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), vol. 11218, 2018, pp. 745–761.
  29. J. Su, W. Byeon, J. Kossaifi, F. Huang, J. Kautz, and A. Anandkumar, “Convolutional tensor-train LSTM for spatio-temporal learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  30. Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, P. S. Yu, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 2, pp. 2208–2225, 2023.
  31. V. L. Guen and N. Thome, “Disentangling physical dynamics from unknown factors for unsupervised video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 474–11 484.
  32. S. Park, K. Kim, J. Lee, J. Choo, J. Lee, S. Kim, and E. Choi, “Vid-ode: Continuous-time video generation with neural ordinary differential equation,” in Proceedings of the AAAI conference on artificial intelligence (AAAI), 2021, pp. 2412–2422.
  33. N. Kim and J. Kang, “Dynamic motion estimation and evolution video prediction network,” IEEE Transactions on Multimedia (TMM), vol. 23, pp. 3986–3998, 2021.
  34. Z. Ye, M. Xia, R. Yi, J. Zhang, Y.-K. Lai, X. Huang, G. Zhang, and Y.-J. Liu, “Audio-driven talking face video generation with dynamic convolution kernels,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 2033–2046, 2023.
  35. S. Lee, H. G. Kim, D. H. Choi, H. Kim, and Y. M. Ro, “Video prediction recalling long-term motion context via memory alignment learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3054–3063.
  36. Z. Chang, X. Zhang, S. Wang, S. Ma, and W. Gao, “STRPM: A spatiotemporal residual predictive model for high-resolution video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 13 926–13 935.
  37. G. Chen, W. Zhang, H. Lu, S. Gao, Y. Wang, M. Long, and X. Yang, “Continual predictive learning from videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 718–10 727.
  38. W. Yu, W. Chen, S. Yin, S. Easterbrook, and A. Garg, “Modular action concept grounding in semantic video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3595–3604.
  39. D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive video models,” in Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  40. R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev, “Latent video transformer,” in Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021, pp. 101–112.
  41. Z. Yang, X. Yang, and Q. Lin, “Tctn: A 3d-temporal convolutional transformer network for spatiotemporal predictive learning,” arXiv preprint arXiv:2112.01085, 2021.
  42. Y. Wu, Q. Wen, and Q. Chen, “Optimizing video prediction via video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 793–17 802.
  43. J. Huang, Y. Jin, K. M. Yi, and L. Sigal, “Layered controllable video generation,” in Proceedings of the European Conference on Computer Vision (ECCV), vol. 13676, 2022, pp. 546–564.
  44. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 1724–1734.
  45. Y. Kwon and M. Park, “Predicting future frames using retrospective cycle GAN,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1811–1820.
  46. J. Xu, B. Ni, and X. Yang, “Progressive multi-granularity analysis for video prediction,” International Journal of Computer Vision (IJCV), vol. 129, no. 3, pp. 601–618, 2021.
  47. P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 304–311.
  48. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  49. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing (TIP), vol. 13, no. 4, pp. 600–612, 2004.
  50. Z. Chang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Stam: A spatiotemporal attention based memory for video prediction,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 2354–2367, 2023.
  51. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  52. L. N.Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Proceedings of the International Society for Optics and Photonics, 2019.
  53. Z. Li, C. Lu, J. Qin, C. Guo, and M. Cheng, “Towards an end-to-end framework for flow-guided video inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 541–17 550.
  54. X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion GAN for future-flow embedded video prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1762–1770.
  55. W. Lotter, G. Kreiman, and D. D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  56. Z. Hao, X. Huang, and S. J. Belongie, “Controllable video generation with sparse trajectories,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7854–7863.
  57. W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos, “Contextvp: Fully context-aware video prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 781–797.
  58. B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li, “Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4553–4562.
  59. D. Geng, M. Hamilton, and A. Owens, “Comparing correspondences: Video prediction with correspondence-wise losses,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3355–3366.
  60. X. Jia, B. D. Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 667–675.
  61. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” in Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
  62. X. Gao, Y. Jin, Q. Dou, C. Fu, and P. Heng, “Accurate grid keypoint learning for efficient video prediction,” in Proceedings of the International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 5908–5915.
Citations (5)

Summary

We haven't generated a summary for this paper yet.