NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines (2401.10761v1)
Abstract: The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bj{\o}ntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.
- Cisco annual internet report (2018–2023) white paper. Accessed: Feb. 2023. [Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html
- “Call for evidence for video coding for machines,” in ISO/IEC JTC 1/SC29/WG 2, m55065, Oct 2020.
- J. Ascenso, “JPEG AI use cases and requirements,” in ISO/IEC JTC1/SC29/WG1 M90014, Jan 2021.
- “Use cases and requirements for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 2 N190, April 2022.
- G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
- B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, Aug 2021.
- W. Duan, K. Lin, C. Jia, X. Zhang, S. Ma, and W. Gao, “End-to-End Image Compression via Attention-Guided Information-Preserving Module,” in 2022 IEEE International Conference on Multimedia and Expo (ICME), Jul. 2022, pp. 1–6.
- N. Zou, H. Zhang, F. Cricri, H. Tavakoli, J. Lainema, M. Hannuksela, E. Aksu, and E. Rahtu, “L22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTC – learning to learn to compress,” in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), ser. IEEE International Workshop on Multimedia Signal Processing. IEEE, Sep. 2020, pp. 1–6.
- B. Li, J. Liang, and J. Han, “Variable-Rate Deep Image Compression With Vision Transformers,” IEEE Access, vol. 10, pp. 50 323–50 334, 2022.
- Y.-H. Ho, C.-C. Chan, W.-H. Peng, H.-M. Hang, and M. Domański, “ANFIC: Image compression using augmented normalizing flows,” IEEE Open Journal of Circuits and Systems, vol. 2, pp. 613–626, 2021.
- Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7939–7948.
- J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in International Conference on Learning Representations, 2018.
- D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 10 771–10 780.
- Recommendation ITU-T H.266 | ISO/IEC 23090-3, “Versatile video coding,” 2020.
- E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8503–8512.
- T. Ladune and P. Philippe, “Aivc: Artificial intelligence based video codec,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 316–320.
- F. Mentzer, G. Toderici, D. Minnen, S.-J. Hwang, S. Caelles, M. Lucic, and E. Agustsson, “Vct: A video compression transformer,” arXiv preprint arXiv:2206.07307, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
- H. Choi and I. V. Bajić, “Affine Transformation-Based Deep Frame Prediction,” IEEE Transactions on Image Processing, vol. 30, pp. 3321–3334, 2021.
- N. Zou, H. Zhang, F. Cricri, H. R. Tavakoli, J. Lainema, E. Aksu, M. Hannuksela, and E. Rahtu, “End-to-end learning for video frame compression with self-attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 142–143.
- N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: an end-to-end learned approach,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 1590–1594.
- N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, H. R. Tavakoli, and E. Rahtu, “Learned image coding for machines: A content-adaptive approach,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
- K. Fischer, F. Brand, C. Herglotz, and A. Kaup, “Video coding for machines with feature-based rate-distortion optimization,” IEEE 22nd International Workshop on Multimedia Signal Processing, p. 6, September 2020.
- ——, “Learning frequency-specific quantization scaling in vvc for standard-compliant task-driven image coding,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 476–480.
- M. Yamazaki, Y. Kora, T. Nakao, X. Lei, and K. Yokoo, “Deep Feature Compression using Rate-Distortion Optimization Guided Autoencoder,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 1216–1220.
- J. Seppälä, H. Zhang, N. Le, R. G. Youvalari, F. Cricri, H. R. Tavakoli, E. Aksu, M. M. Hannuksela, and E. Rahtu, “Enhancing image coding for machines with compressed feature residuals,” in 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 2021, pp. 217–225.
- S. Chen, J. Jin, L. Meng, W. Lin, Z. Chen, T.-S. Chang, Z. Li, and H. Zhang, “A New Image Codec Paradigm for Human and Machine Uses,” Dec. 2021.
- H. Choi and I. V. Bajić, “Scalable Image Coding for Humans and Machines,” IEEE Transactions on Image Processing, vol. 31, pp. 2739–2754, Jan. 2022.
- ——, “Scalable Video Coding for Humans and Machines,” Aug. 2022.
- Z. Huang, C. Jia, S. Wang, and S. Ma, “HMFVC: A Human-Machine Friendly Video Compression Scheme,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2022.
- Y. Li, L. Zhang, and K. Zhang, “Idam: Iteratively trained deep in-loop filter with adaptive model selection,” ACM Trans. Multimedia Comput. Commun. Appl., apr 2022.
- Z. Huang, J. Sun, X. Guo, and M. Shang, “Adaptive deep reinforcement learning-based in-loop filter for vvc,” IEEE Transactions on Image Processing, vol. 30, pp. 5439–5451, 2021.
- C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma, “Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3343–3356, 2019.
- J. I. Ahonen, R. G. Youvalari, N. Le, H. Zhang, F. Cricri, H. R. Tavakoli, M. M. Hannuksela, and E. Rahtu, “Learned enhancement filters for image coding for machines,” in 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 2021, pp. 235–239.
- F. Nasiri, W. Hamidouche, L. Morin, N. Dhollande, and G. Cocherel, “Model selection cnn-based vvc quality enhancement,” in 2021 Picture Coding Symposium (PCS), 2021, pp. 1–5.
- I. Schiopu and A. Munteanu, “Deep learning post-filtering using multi-head attention and multiresolution feature fusion for image and intra-video quality enhancement,” Sensors, vol. 22, no. 4, 2022.
- Y.-H. Lam, A. Zare, F. Cricri, J. Lainema, and M. M. Hannuksela, “Efficient adaptation of neural network filter for video compression,” in Proceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 358–366.
- M. Santamaria, Y.-H. Lam, F. Cricri, J. Lainema, R. G. Youvalari, H. Zhang, M. M. Hannuksela, E. Rahtu, and M. Gaubbuj, “Content-adaptive convolutional neural network post-processing filter,” in 2021 IEEE International Symposium on Multimedia (ISM), 2021, pp. 99–106.
- M. Santamaria, F. Cricri, J. Lainema, R. G. Youvalari, H. Zhang, and M. M. Hannuksela, “Content-adaptive neural network post-processing filter with nnr-coded weight-updates,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 2251–2255.
- C. Liu, H. Sunyz, J. Kattoz, X. Zeng, and Y. Fan, “A qp-adaptive mechanism for cnn-based filter in video coding,” in 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022, pp. 3195–3199.
- Z. Huang, X. Guo, M. Shang, J. Gao, and J. Sun, “An efficient qp variable convolutional neural network based in-loop filter for intra coding,” in 2021 Data Compression Conference (DCC), 2021, pp. 33–42.
- J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Int’l Conf on Learning Representations (ICLR), Toulon, France, April 2017.
- D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” pp. 10 771–10 780, 2018.
- F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 913–11 924, 2020.
- N. Le, H. Zhang, F. Cricri, R. G. Youvalari, H. R. Tavakoli, E. Aksu, M. M. Hannuksela, and E. Rahtu, “Bridging the gap between image coding for machines and humans,” in 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 3411–3415.
- J. Duda, K. Tahboub, N. J. Gadgil, and E. J. Delp, “The use of asymmetric numeral systems as an accurate replacement for huffman coding,” in 2015 Picture Coding Symposium (PCS), 2015, pp. 65–69.
- H. Zhang, F. Cricri, H. R. Tavakoli, E. Aksu, and M. M. Hannuksela, “Leveraging progressive model and overfitting for efficient learned image compression,” 2022.
- R. Zhang, “Making convolutional networks shift-invariant again,” in International conference on machine learning. PMLR, 2019, pp. 7324–7334.
- S. Marcel and Y. Rodriguez, “Torchvision the machine-vision package of torch,” in Proceedings of the 18th ACM international conference on Multimedia, ser. MM ’10. Association for Computing Machinery, pp. 1485–1488.
- H. Zhang, N. Le, F. Cricri, J. Ahonen, and H. Tavakoli, “Stabilizing the convolution operations for neural network-based image and video codecs for machines,” in 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Brisbane, Australia, 2023, pp. 170–175.
- A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, 2020.
- Versatile video coding (VVC) reference software VTM. [Online]. Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM
- D. Ma, F. Zhang, and D. Bull, “Bvi-dvc: a training database for deep video compression,” IEEE Transactions on Multimedia, 2021.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
- G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” ITU-T Video Coding Experts Group (VCEG), 2001.
- W. Gao, X. Xu, M. Qin, and S. Liu, “An Open Dataset for Video Coding for Machines Standardization,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 4008–4012.
- H. Choi, E. Hosseini, S. Ranjbar Alvar, R. Cohen, and I. Bajić, “Sfu-hw-objects-v1: Object labelled dataset on raw video sequences,” 2020.
- “Common test conditions for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 04, Jan 2022.
- Coco evaluation. [Online]. Available: https://cocodataset.org/#detection-eval
- A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
- Jukka I. Ahonen (2 papers)
- Nam Le (15 papers)
- Honglei Zhang (32 papers)
- Antti Hallapuro (3 papers)
- Francesco Cricri (22 papers)
- Hamed Rezazadegan Tavakoli (6 papers)
- Miska M. Hannuksela (6 papers)
- Esa Rahtu (78 papers)