Insights from Generative Modeling for Neural Video Compression (2107.13136v2)
Abstract: While recent machine learning research has revealed connections between deep generative models such as VAEs and rate-distortion losses used in learned compression, most of this work has focused on images. In a similar spirit, we view recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling. We present these codecs as instances of a generalized stochastic temporal autoregressive transform, and propose new avenues for further improvements inspired by normalizing flows and structured priors. We propose several architectures that yield state-of-the-art video compression performance on high-resolution video and discuss their tradeoffs and ablations. In particular, we propose (i) improved temporal autoregressive transforms, (ii) improved entropy models with structured and temporal dependencies, and (iii) variable bitrate versions of our algorithms. Since our improvements are compatible with a large class of existing models, we provide further evidence that the generative modeling viewpoint can advance the neural video coding field.
- Y. Yang, S. Mandt, and L. Theis, “An introduction to neural data compression,” arXiv preprint arXiv:2202.06533, 2022.
- D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10 771–10 780.
- Y. Yang, R. Bamler, and S. Mandt, “Improving inference for neural image compression,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- F. Bellard, “Bpg image format,” 2014. [Online]. Available: https://bellard.org/bpg/bpg_spec.txt
- E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8503–8512.
- T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
- G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
- L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
- L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
- J. Marino, L. Chen, J. He, and S. Mandt, “Improving sequential latent variable models with autoregressive flows,” in Symposium on Advances in Approximate Bayesian Inference, 2020, pp. 1–16.
- G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 006–11 015.
- H. Liu, H. Shen, L. Huang, M. Lu, T. Chen, and Z. Ma, “Learned video compression via joint spatial-temporal correlation exploration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 11 580–11 587.
- J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for learned video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3546–3554.
- G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao, “Content adaptive and error propagation aware deep video compression,” in European Conference on Computer Vision. Springer, 2020, pp. 456–472.
- R. Yang, Y. Yang, J. Marino, Y. Yang, and S. Mandt, “Deep generative video compression with temporal autoregressive transforms,” ICML 2020 Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2020.
- G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in Advances in Neural Information Processing Systems, 2017, pp. 2338–2347.
- J. Ballé, P. A. Chou, D. Minnen, S. Singh, N. Johnston, E. Agustsson, S. J. Hwang, and G. Toderici, “Nonlinear transform coding,” arXiv preprint arXiv:2007.03034, 2020.
- J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Advances in neural information processing systems, 2015, pp. 2980–2988.
- O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. D. Bourdev, “Learned video compression,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3453–3462, 2019.
- G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE signal processing magazine, vol. 15, no. 6, pp. 74–90, 1998.
- R. E. Kalman et al., “A new approach to linear filtering and prediction problems [j],” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
- C. C. Cutler, “Differential quantization of communication signals,” Jul. 29 1952, uS Patent 2,605,361.
- G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2021.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- V. K. Goyal, “Theoretical foundations of transform coding,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.
- J. Behrmann, P. Vicol, K.-C. Wang, R. Grosse, and J.-H. Jacobsen, “Understanding and mitigating exploding inverses in invertible neural networks,” 2020.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, 2018, pp. 10 215–10 224.
- B. Uria, I. Murray, and H. Larochelle, “Rnade: The real-valued neural autoregressive density-estimator,” Advances in Neural Information Processing Systems, vol. 26, 2013.
- J. Ho, E. Lohn, and P. Abbeel, “Compression with flows via local bits-back coding,” in Advances in Neural Information Processing Systems, 2019, pp. 3874–3883.
- J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in 5th International Conference on Learning Representations, ICLR 2017, 2017.
- L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” International Conference on Learning Representations, 2017.
- F. Schmidt and T. Hofmann, “Deep state space models for unconditional word generation,” in Advances in Neural Information Processing Systems, 2018, pp. 6158–6168.
- F. Schmidt, S. Mandt, and T. Hofmann, “Autoregressive text generation beyond feedback loops,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3391–3397.
- C. A. Glasbey and K. V. Mardia, “A review of image-warping methods,” Journal of applied statistics, vol. 25, no. 2, pp. 155–171, 1998.
- A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video compression with rate-distortion autoencoders,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7033–7042.
- E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” in International Conference on Machine Learning. PMLR, 2018, pp. 1174–1183.
- Y. Li and S. Mandt, “Disentangled sequential autoencoder,” in Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018, pp. 5670–5679.
- K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in Advances in Neural Information Processing Systems, vol. 28, 2015.
- Y. Choi, M. El-Khamy, and J. Lee, “Variable rate deep image compression with a conditional autoencoder,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3146–3154.
- G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4385–4393.
- J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimization of nonlinear transform codes for perceptual quality,” 2016.
- Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
- J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” International Conference on Learning Representations, 2018.
- D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 3339–3343.
- Y. Yang, R. Bamler, and S. Mandt, “Variational bayesian quantization,” in International Conference on Machine Learning, 2020.
- G. Flamich, M. Havasi, and J. M. Hernández-Lobato, “Compression without quantization,” in OpenReview, 2019.
- L. Helminger, A. Djelouah, M. Gross, and C. Schroers, “Lossy image compression with normalizing flows,” arXiv preprint arXiv:2008.10486, 2020.
- C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through image interpolation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 416–431.
- A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6420–6428.
- J. Han, S. Lombardo, C. Schroers, and S. Mandt, “Deep generative video compression,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 566–576, 2019.
- T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma, “Deepcoder: A deep neural network based video compression,” in 2017 IEEE Visual Communications and Image Processing (VCIP), 2017, pp. 1–4.
- A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2758–2766.
- M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in International Conference on Learning Representations, 2018.
- C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in Advances in neural information processing systems, 2016, pp. 613–621.
- A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” arXiv preprint arXiv:1804.01523, 2018.
- D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, 2015, pp. 1530–1538.
- D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improved variational inference with inverse autoregressive flow,” in Advances in neural information processing systems, 2016, pp. 4743–4751.
- T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision (IJCV), vol. 127, no. 8, pp. 1106–1125, 2019.
- R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for video compression with hierarchical quality and recurrent enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, 2020, pp. 297–302.
- H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 1509–1513.
- Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
- Y. Yang, G. Sautière, J. J. Ryu, and T. S. Cohen, “Feedback recurrent autoencoder,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3347–3351.
- N. Johnston, E. Eban, A. Gordon, and J. Ballé, “Computationally efficient neural image compression,” arXiv preprint arXiv:1912.08771, 2019.
- R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for video compression with recurrent auto-encoder and recurrent probability model,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 388–401, 2021.
- C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville, “Neural autoregressive flows,” in International Conference on Machine Learning. PMLR, 2018, pp. 2078–2087.
- F. Mentzer, G. Toderici, D. Minnen, S.-J. Hwang, S. Caelles, M. Lucic, and E. Agustsson, “Vct: A video compression transformer,” arXiv preprint arXiv:2206.07307, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
- Ruihan Yang (43 papers)
- Yibo Yang (80 papers)
- Joseph Marino (19 papers)
- Stephan Mandt (100 papers)