Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity (2403.02944v2)
Abstract: Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions.
- Gpt-4 technical report. arXiv preprint 2303.08774, 2023.
- Multi-realism image compression with a conditional generator. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, 2016.
- End-to-end optimized image compression. In International Conference on Learning Representations, 2017.
- Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
- Towards improved lossy image compression: Human image reconstruction with public-domain images. arXiv preprint 1810.11137, 2018.
- Demystifying MMD GANs. International Conference on Learning Representations, 2018.
- The perception-distortion tradeoff. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- Towards image compression with perfect realism at ultra-low bitrates. In International Conference on Learning Representations, 2024.
- Microsoft COCO captions: Data collection and evaluation server. In arXiv preprint 1504.00325, 2015.
- Vision transformer adapter for dense predictions. In International Conference on Learning Representations, 2023.
- Learned image compression with discretized Gaussian mixture likelihoods and attention modules. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Image quality measures and their performance. IEEE Transactions on Communications, 1995.
- Franzen, R. Kodak lossless true color image suite, 1999.
- Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 2017.
- High-fidelity image compression with score-based generative models. arXiv preprint 2305.18231, 2023.
- Rethinking FID: Towards a better evaluation metric for image generation. arXiv preprint 2401.09603, 2023.
- Multi-modality deep network for extreme learned image compression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
- Text + Sketch: Image compression at ultra low rates. arXiv preprint 2307.01944v1, 2023.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, 2023.
- Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 2014.
- Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023a.
- Learned image compression with mixed transformer-cnn architectures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
- High-fidelity generative image compression. In Advances in Neural Information Processing Systems, 2020.
- Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, 2018.
- Improving statistical fidelity for neural image compression with implicit local likelihood models. In Proceedings of the International Conference on Machine Learning, 2023.
- Extreme generative image compression by learning text embedding from diffusion models. arXiv preprint 2211.07793v1, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2002.
- PieAPP: Perceptual image-error assessment through pairwise preference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- Perceptual image compression with cooperative cross-modal side information. arXiv preprint arXiv:2311.13847, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 2021.
- Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, 2021.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Shannon, C. E. Prediction and entropy of printed english. Bell System Technical Journal, 1951.
- CLIC 2020: Challenge on learned image compression, 2020, 2020.
- CIDEr: Consensus-based image description evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
- OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning, 2022.
- Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. IEEE, 2003.
- Weissman, T. Toward textual transform coding. arXiv preprint 2305.01857v1, 2023.
- AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- Lossy image compression with conditional diffusion models. Advances in Neural Information Processing Systems, 2023.
- Adding conditional control to text-to-image diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- Hagyeong Lee (1 paper)
- Minkyu Kim (51 papers)
- Jun-Hyuk Kim (14 papers)
- Seungeon Kim (3 papers)
- Dokwan Oh (5 papers)
- Jaeho Lee (51 papers)