Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extreme Video Compression with Pre-trained Diffusion Models (2402.08934v1)

Published 14 Feb 2024 in eess.IV and cs.CV
Extreme Video Compression with Pre-trained Diffusion Models

Abstract: Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models-

Extreme Video Compression Using Pre-trained Diffusion Models

The paper "Extreme Video Compression With Prediction Using Pre-trained Diffusion Models" presents an innovative approach to video compression by leveraging diffusion-based generative models at the decoder. The approach targets ultra-low bit-rate video reconstruction with high perceptual quality. This endeavor is motivated by the significant increase in video data transmission, particularly due to emerging technologies like augmented reality, virtual reality, and the metaverse. The proposed method introduces a significant shift from conventional video compression methods such as H.264 and H.265, which primarily rely on hand-engineered optical flow and motion compensation techniques.

The core strategy utilizes pre-trained diffusion models, specifically leveraging their predictive capabilities to minimize the bit-rate required for transmitting video data. This encoding method selectively compresses only a subset of video frames using neural image compressors, while the remaining frames are predicted at the decoder using a diffusion-based generative model. This innovative technique facilitates achieving a bit-rate as low as 0.02 bits per pixel (bpp), which is significantly lower compared to traditional codecs.

The experimental evaluation across multiple datasets, including Stochastic Moving MNIST (SMMNIST), Cityscapes, and the Ultra Video Group (UVG) dataset, demonstrates robust performance of this approach. Notably, compared to standard codecs such as H.264 and H.265, the method achieves superior performance in terms of perceptual metrics like LPIPS and FVD, while maintaining compatible results in terms of PSNR. Particularly in datasets with significant motion and texture variation, the method outperforms traditional codecs, achieving more visually pleasing results, despite operating at ultra-low bpp.

From a theoretical standpoint, this approach underscores the potential of generative AI in optimizing data transmission by exploiting spatial and temporal video coherence without explicit motion estimation. Practically, this methodology can substantially reduce the bandwidth requirements for video streaming and broadcasting, making it highly relevant for applications involving immersive media technologies. One limitation, however, is the increased computational demand at the encoding stage due to the additional video prediction processing. Future work could explore more efficient prediction models to offset this complexity.

In conclusion, the paper offers significant insight into how pre-trained diffusion models can revolutionize video compression, aligning with broader trends in AI-driven methodologies for multimedia processing. The potential for further advancements in this domain is considerable, with implications for improved efficiency in broadcasting, streaming services, and storage solutions. The exploration of alternative or improved generative architectures may further enhance the performance, ensuring scalability and adaptability to various practical scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Learned video compression,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3453–3462.
  2. “Adversarial distortion for learned video compression,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 640–644.
  3. “Learned video compression with efficient temporal context learning,” IEEE Transactions on Image Processing, vol. 32, pp. 3188–3198, 2023.
  4. “Align your latents: High-resolution video synthesis with latent diffusion models,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22563–22575.
  5. “Video diffusion models,” arXiv:2204.03458, 2022.
  6. “Transformation-based adversarial video prediction on large-scale data,” CoRR, vol. abs/2003.04035, 2020.
  7. “Stochastic dynamics for video infilling,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2703–2712.
  8. “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 6840–6851, Curran Associates, Inc.
  9. “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021.
  10. “Deep contextual video compression,” Advances in Neural Information Processing Systems, vol. 34, pp. 18114–18125, 2021.
  11. “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11006–11015.
  12. “Neural video compression with diverse contexts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22616–22626.
  13. “Mocogan: Decomposing motion and content for video generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1526–1535.
  14. “Denoising diffusion restoration models,” Advances in Neural Information Processing Systems, vol. 35, pp. 23593–23606, 2022.
  15. “Diffusion probabilistic modeling for video generation,” arXiv preprint arXiv:2203.09481, 2022.
  16. “Diffusion models for video prediction and infilling,” Transactions on Machine Learning Research, 2022.
  17. “Mcvd-masked conditional video diffusion for prediction, generation, and interpolation,” Advances in Neural Information Processing Systems, vol. 35, pp. 23371–23385, 2022.
  18. “Neural video compression using gans for detail synthesis and propagation,” in Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, Eds., Cham, 2022, pp. 562–578, Springer Nature Switzerland.
  19. “Lossy image compression with conditional diffusion models,” arXiv:2209.06950, 2023.
  20. “A residual diffusion model for high perceptual quality codec augmentation,” arXiv:2301.05489, 2023.
  21. “High-fidelity image compression with score-based generative models,” arXiv:2305.18231, 2023.
  22. “Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in IEEE/CVF Conf. on Computer Vision and Pattern Recog., 2022, pp. 5718–5727.
  23. “Unsupervised learning of video representations using lstms,” in International conference on machine learning. PMLR, 2015, pp. 843–852.
  24. “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  25. “Uvg dataset: 50/120fps 4k sequences for video codec analysis and development,” in 11th ACM Multimedia Systems Conf., 2020, pp. 297–302.
  26. “Rethinking lossy compression: The rate-distortion-perception tradeoff,” in International Conference on Machine Learning. PMLR, 2019, pp. 675–685.
  27. “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  28. “Fvd: A new metric for video generation,” 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Bohan Li (87 papers)
  2. Yiming Liu (53 papers)
  3. Xueyan Niu (15 papers)
  4. Bo Bai (71 papers)
  5. Lei Deng (81 papers)
  6. Deniz Gündüz (144 papers)
Citations (2)
Reddit Logo Streamline Icon: https://streamlinehq.com