Generating Images with Sparse Representations (2103.03841v1)

Published 5 Mar 2021 in cs.CV and stat.ML

Abstract: The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models. We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images. On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state of the art methods. We additionally show that simple modifications to our method yield effective image colorization and super-resolution models.

PDF Abstract

Generating Images with Sparse Representations

The present work introduces a novel approach to generating images using sparse representations based on the Discrete Cosine Transform (DCT), differing fundamentally from past approaches that rely on pixel-based data inputs. The authors propose a Transformer-based autoregressive model termed "DCTransformer," which tackles the high dimensionality and complexity challenges associated with generative image models.

Methodological Overview

The model leverages sparse representations by converting images into sequences of DCT-related triples (channel, spatial location, and coefficient values). This approach parallels traditional image compression techniques such as JPEG, where images are processed into frequency components to aid in compact storage and efficient manipulation. The novelty here lies in applying these compression principles to deep generative models, taking advantage of natural image redundancy to reduce required computational resources.

Model Architecture and Training

Central to the model is the DCTransformer, constructed to predict future sequence elements based on previous ones, utilizing a sparsified sequence of DCT data. The sequence is handled through an autoregressive manner, predicting channels, positions, and values in succession. The architecture features a chunked training mechanism to efficiently process large image sequences, ensuring scalability to higher resolutions without overwhelming memory resources.

Three distinct Transformer decoders are organized hierarchically within DCTransformer, each dedicated to predicting one of the sequence components: DCT channel, spatial location, and quantized DCT value. This stacking approach enhances sequence handling capabilities, maintaining constant memory and computational demands due to the fixed-size chunking method.

Experimental Results

The DCTransformer was evaluated against traditional models like GANs and VQ-VAEs across various benchmarks. It demonstrated competitive performance concerning sample diversity and image quality, albeit trailing in some precision metrics typically dominated by GANs. Notably, it achieves state-of-the-art spatial fidelity (sFID) scores on several datasets, underpinning its capacity to produce texturally rich and diverse samples.

Moreover, the versatility of DCTransformer extends to auxiliary tasks such as image super-resolution and colorization, facilitated by configurable sequence ordering focused on luminance and chrominance separation.

Implications and Future Directions

The impulse to model image data via frequency-based sparse representations potentially reshapes the design philosophies underpinning generative models. The application of compressed representations could influence AI fields well beyond imagery, particularly where data efficiency is paramount, like audio and video processing.

While promising, the DCTransformer still requires substantial computational resources, especially in high-resolution scenarios, which the field generally needs to address. Future investigations may focus on refining sparse representation techniques and enhancing model efficiency further, aiming to balance model optimization with resource allocation.

In conclusion, by aligning deep learning methodologies with established data compression frameworks, this research opens avenues for generating high-quality images in a computationally feasible manner, merging historical data reduction insights with state-of-the-art AI techniques.