Perceptual Image Compression with Cooperative Cross-Modal Side Information (2311.13847v2)
Abstract: The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.
- Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 221–231, 2019.
- Deep image compression using decoder side information. In European Conference on Computer Vision, pages 699–714. Springer, 2020.
- End-to-end optimized image compression. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
- Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
- Fabrice Bellard. Bpg image format, 2018.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- The 2018 pirm challenge on perceptual image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7939–7948, 2020.
- Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
- Drasic: Distributed recurrent autoencoder for scalable image compression. In 2020 Data Compression Conference (DCC), pages 3–12. IEEE, 2020.
- Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2):9, 1999.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Learned distributed image compression with multi-scale patch matching in feature domain. arXiv preprint arXiv:2209.02514, 2022.
- Multi-modality deep network for extreme learned image compression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1033–1041, 2023.
- Cross modal compression: Towards human-comprehensible semantic compression. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4230–4238, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Text to image generation with semantic-spatial aware gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18187–18196, 2022.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- High-fidelity generative image compression. Advances in Neural Information Processing Systems, 33:11913–11924, 2020.
- Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31, 2018.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- An overview of the jpeg 2000 still image compression standard. Signal processing: Image communication, 17(1):3–48, 2002.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Noiseless coding of correlated information sources. IEEE Transactions on information Theory, 19(4):471–480, 1973.
- Nima: Neural image assessment. IEEE transactions on image processing, 27(8):3998–4011, 2018.
- Clic 2020: Challenge on learned image compression, 2020, 2020.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on information Theory, 22(1):1–10, 1976.
- Deep semantic dictionary learning for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3572–3580, 2021.
- Shiyu Qin (5 papers)
- Bin Chen (546 papers)
- Yujun Huang (10 papers)
- Baoyi An (8 papers)
- Tao Dai (57 papers)
- Shu-Tao Xia (171 papers)