Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields (2504.21814v1)

Published 30 Apr 2025 in cs.CV

Abstract: The rapid development of AIGC foundation models has revolutionized the paradigm of image compression, which paves the way for the abandonment of most pixel-level transform and coding, compelling us to ask: why compress what you can generate if the AIGC foundation model is powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than some compact descriptors, i.e., texts, or cues. Fortunately, recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities, which motivates us to answer the above question by exploring its potential in image compression fields. In this work, we investigate two typical compression paradigms: textual coding and multimodal coding (i.e., text + extremely low-resolution image), where all/most pixel-level information is generated instead of compressing via the advanced GPT-4o image generation function. The essential challenge lies in how to maintain semantic and structure consistency during the decoding process. To overcome this, we propose a structure raster-scan prompt engineering mechanism to transform the image into textual space, which is compressed as the condition of GPT-4o image generation. Extensive experiments have shown that the combination of our designed structural raster-scan prompts and GPT-4o's image generation function achieved the impressive performance compared with recent multimodal/generative image compression at ultra-low bitrate, further indicating the potential of AIGC generation in image compression fields.

Summary

Exploring Cross-Modality Image Compression Using Generative Methods

The paper "Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields" presents a novel approach to the longstanding challenge of image compression by leveraging the advancements in artificial intelligence-generated content (AIGC) models, specifically focusing on OpenAI's GPT-4o's image generation capabilities. This research introduces the concept of replacing traditional pixel-level image compression techniques with a generative approach that predominantly uses descriptors like text to recreate images, thereby maintaining structure and semantic integrity at ultra-low bitrates.

Background and Problem Statement

Traditional image compression can be categorized largely into two domains: traditional image coding and learned image coding, both of which reduce spatial redundancies using pixel-wise transform coding. However, these conventional methods face significant limitations, particularly at ultra-low bitrates where they heavily compromise semantic integrity and perceptual quality. Recent advancements in AIGC models suggest a potential paradigm shift wherein image data could be synthesized using compact descriptors instead of being stored or transmitted in its entirety.

Methodology

The paper investigates two paradigms of compression: textual coding and multimodal coding (a combination of text and extremely low-resolution images). In both scenarios, the pixel-level information needed for high-quality image reconstruction is generated rather than compressed using generative functions of GPT-4o. A key challenge in this approach is ensuring semantic and structural consistency between the original and the generated images.

To address this, the authors introduce a structural raster-scan prompt engineering mechanism that aids in translating images into textual descriptions. These descriptions are used as conditions in GPT-4o's generative process. This mechanism involves describing an image's elements in a structured order to maintain spatial and visual consistency.

Results

Experimental results are indicative of the model's capability to compress images to bitrates as low as 0.001 bits per pixel (bpp) while preserving perceptual quality and semantic accuracy, surpassing the efficiencies of many existing multimodal/generative compression methods at such low bitrates. Comparative analysis with state-of-the-art methods such as MS-ILLM, PICS, and PerCo demonstrates the superiority of the presented generative method across various perceptual and consistency metrics.

Implications and Future Directions

The research sets a precedent for applying cross-modality generative models in the compression field, revealing the potential to significantly reduce data storage and transmission needs without sacrificing image quality. This paradigm shift to generative-based compression models leverages the sophisticated multimodal abilities of LLMs like GPT-4o.

Practically, such techniques could pave the way for more efficient data storage solutions and enhance the capabilities of low-bandwidth communication systems, thereby having significant implications for industries reliant on high-volume image data, such as telecommunications and digital media.

The theoretical implications lie in expanding the application of generative models to solve complex, real-world problems beyond standard data synthesis tasks, heralding a new era of computational efficiency in data handling processes.

Future research could explore enhancing these models' compatibility without retraining, utilizing similar approaches in video compression, and addressing challenges related to real-time image generation in dynamic environments. Furthermore, developments could examine the integration of additional modalities to refine and augment the generative model's outputs, tailored to specific use cases requiring fine granularity in details and broader adoption in diverse technological frameworks.