Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression (2312.15829v2)

Published 25 Dec 2023 in eess.IV

Abstract: Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 006–11 015.
  2. H. Liu, M. Lu, Z. Chen, X. Cao, Z. Ma, and Y. Wang, “End-to-end neural video coding using a compound spatiotemporal representation,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 32, no. 8, pp. 5650–5662, 2022.
  3. Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 30, no. 2, pp. 566–576, 2019.
  4. E. Agustsson, D. Minnen, N. Johnston, J. Ballé, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8500–8509.
  5. H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao, and Y. Wang, “Neural video coding using multiscale motion compensation and spatiotemporal context model,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 31, no. 8, pp. 3182–3196, 2020.
  6. Z. Hu, Z. Chen, D. Xu, G. Lu, W. Ouyang, and S. Gu, “Improving deep video compression by resolution-adaptive flow coding,” in European Conference on Computer Vision (ECCV), 2020, pp. 193–209.
  7. O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev, “Elf-vc: Efficient learned flexible-rate video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 14 479–14 488.
  8. J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for learned video compression,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  9. T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Déforges, “Optical flow and mode selection for learning-based video coding,” in IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020, pp. 1–6.
  10. F. Brand, J. Seiler, and A. Kaup, “Generalized difference coder: a novel conditional autoencoder structure for video compression,” arXiv preprint arXiv:2112.08011, 2021.
  11. J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 18 114–18 125.
  12. X. Sheng, J. Li, B. Li, L. Li, D. Liu, and Y. Lu, “Temporal context mining for learned video compression,” IEEE Transactions on Multimedia (TMM), vol. 25, pp. 7311–7322, 2023.
  13. J. Li, B. Li, and Y. Lu, “Hybrid spatial-temporal entropy modelling for neural video compression,” in Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), 2022, pp. 1503–1511.
  14. Y.-H. Ho, C.-P. Chang, P.-Y. Chen, A. Gnutti, and W.-H. Peng, “Canf-vc: Conditional augmented normalizing flows for video compression,” in European Conference on Computer Vision (ECCV), 2022, pp. 207–223.
  15. J. Li, B. Li, and Y. Lu, “Neural video compression with diverse contexts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 616–22 626.
  16. M.-J. Chen, Y.-H. Chen, and W.-H. Peng, “B-canf: Adaptive b-frame coding with conditional augmented normalizing flows,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023.
  17. F. Brand, J. Seiler, and A. Kaup, “Conditional residual coding: A remedy for bottleneck problems in conditional inter frame coding,” arXiv preprint arXiv:2307.12864, 2023.
  18. K. Lin, C. Jia, X. Zhang, S. Wang, S. Ma, and W. Gao, “Dmvc: Decomposed motion modeling for learned video compression,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 7, pp. 3502–3515, 2023.
  19. Z. Hu, G. Lu, and D. Xu, “Fvc: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1502–1511.
  20. Z. Hu, G. Lu, J. Guo, S. Liu, W. Jiang, and D. Xu, “Coarse-to-fine deep video coding with hyperprior-guided mode prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5921–5930.
  21. Y. Shi, Y. Ge, J. Wang, and J. Mao, “Alphavc: High-performance and efficient learned video compression,” in European Conference on Computer Vision (ECCV), 2022, pp. 616–631.
  22. F. Brand, J. Seiler, and A. Kaup, “On benefits and challenges of conditional interframe video coding in light of information theory,” in IEEE Picture Coding Symposium (PCS).   IEEE, 2022, pp. 289–293.
  23. Y. Zhu, Y. Yang, and T. Cohen, “Transformer-based transform coding,” in International Conference on Learning Representations (ICLR), 2021.
  24. M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” in IEEE Data Compression Conference (DCC), 2022, pp. 469–469.
  25. R. Zou, C. Song, and Z. Zhang, “The devil is in the details: Window-based attention for image compression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022, pp. 17 492–17 501.
  26. J. Liu, H. Sun, and J. Katto, “Learned image compression with mixed transformer-cnn architectures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14 388–14 397.
  27. A. Ghorbel, W. Hamidouche, and L. Morin, “Aict: An adaptive image compression transformer,” in IEEE International Conference on Image Processing (ICIP), 2023, pp. 126–130.
  28. F. Mentzer, G. Toderici, D. Minnen, S. Caelles, S. J. Hwang, M. Lucic, and E. Agustsson, “VCT: A video compression transformer,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
  29. J. Xiang, K. Tian, and J. Zhang, “Mimt: Masked image modeling transformer for video compression,” in International Conference on Learning Representations (ICLR), 2022.
  30. D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in IEEE International Conference on Image Processing (ICIP).   IEEE, 2020, pp. 3339–3343.
  31. D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in neural information processing systems (NeurIPS), vol. 31, 2018.
  32. M. S. Ali, Y. Kim, M. Qamar, S.-C. Lim, D. Kim, C. Zhang, S.-H. Bae, and H. Y. Kim, “Towards efficient image compression without autoregressive models,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
  33. F. Brand, J. Seiler, and A. Kaup, “P-frame coding with generalized difference: A novel conditional coding approach,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 1266–1270.
  34. T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Déforges, “Modenet: Mode selection network for learned video coding,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2020, pp. 1–6.
  35. D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 14 771–14 780.
  36. D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, “Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5718–5727.
  37. W. Jiang, J. Yang, Y. Zhai, P. Ning, F. Gao, and R. Wang, “Mlic: Multi-reference entropy model for learned image compression,” in Proceedings of the ACM International Conference on Multimedia (ACM MM), 2023, p. 7618–7627.
  38. J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in International Conference on Learning Representations (ICLR), 2018.
  39. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2021, pp. 10 012–10 022.
  40. J. Zhang, L. Jiao, W. Ma, F. Liu, X. Liu, L. Li, P. Chen, and S. Yang, “Transformer based conditional gan for multimodal image fusion,” IEEE Transactions on Multimedia (TMM), pp. 1–14, 2023.
  41. H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 315–11 325.
  42. N. Mital, E. Özyilkan, A. Garjani, and D. Gündüz, “Neural distributed image compression with cross-attention feature alignment,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2498–2507.
  43. S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022, pp. 5728–5739.
  44. T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision (IJCV), vol. 127, no. 8, pp. 1106–1125, 2019.
  45. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference for Learning Representations (ICLR), 2015.
  46. J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “Compressai: a pytorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
  47. A. Mercat, M. Viitanen, and J. Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the ACM Multimedia Systems Conference, 2020, pp. 297–302.
  48. F. Bossen et al., “Common test conditions and software reference configurations,” JCTVC-L1100, vol. 12, no. 7, 2013.
  49. H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: a jnd-based h. 264/avc video quality assessment dataset,” in IEEE International Conference on Image Processing (ICIP), 2016, pp. 1509–1513.
  50. “Ffmpeg,” https://www.ffmpeg.org/, accessed: 2022-05-18.
  51. G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” VCEG-M33, 13th ITU-T VCEG Meeting, 2001.
  52. F. Lin, H. Sun, J. Liu, and J. Katto, “Multistage spatial context models for learned image compression,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  53. “Hm-16.25,” https://vcgit.hhi.fraunhofer.de/jvet/HM/, accessed: 2023-10-30.
  54. “Vtm-17.0,” https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM, accessed: 2023-10-30.
  55. Y.-H. Ho, C.-C. Chan, W.-H. Peng, H.-M. Hang, and M. Domański, “Anfic: Image compression using augmented normalizing flows,” IEEE Open Journal of Circuits and Systems (OJCAS), vol. 2, pp. 613–626, 2021.
Citations (4)

Summary

We haven't generated a summary for this paper yet.