Colorization Transformer (2102.04432v2)

Published 8 Feb 2021 in cs.CV, cs.AI, and cs.LG

Abstract: We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran

Citations (145)

View on Semantic Scholar

Summary

The paper introduces ColTran, a transformer-based approach that achieves a state-of-the-art FID of 19.37 on image colorization.
The methodology employs a three-step pipeline combining low-resolution autoregressive colorization, parallel upsampling, and spatial upsampling with conditional transformer layers.
Human evaluations reveal that over 60% of the highest-rated outputs are preferred over ground truth, showcasing its aesthetic and technical effectiveness.

Overview of the Colorization Transformer Paper

The paper introduces the Colorization Transformer (ColTran), a method leveraging transformers for the task of image colorization. The emphasis is on producing diverse, high-fidelity colorized versions of grayscale images through an innovative application of self-attention mechanisms. ColTran's architecture is presented as a significant improvement over previous methods, both in terms of quantitative metrics and qualitative evaluation by human raters.

Methodology

ColTran operates through a three-step pipeline:

Low-Resolution Autoregressive Colorization: The process begins with a conditional autoregressive transformer generating a coarse, low-resolution color version of the grayscale image. This involves a conditional variant of the Axial Transformer, which integrates self-attention effectively.
Parallel Color Upsampling: Subsequent to coarse colorization, a parallel network enhances the color depth.
Spatial Upsampling: Finally, another fully parallel network increases the resolution, resulting in a high-resolution colored image.

Conditional transformer layers play a critical role in ColTran, introducing a novel way to leverage conditioning information effectively across layers. These components are designed to incorporate context at multiple levels, enhancing the quality of generated images. Additionally, an auxiliary parallel model is trained to improve model performance further through regularization.

Results

The authors report that the models produce colorizations that outperform the current state-of-the-art on FID scores. Specifically, ColTran achieves an FID of 19.37, surpassing previous best results. Furthermore, a human evaluation paper indicates that the highest rated among three generated colorings is preferred by evaluators over the ground truth in more than 60% of cases, underscoring the method's capability to generate aesthetically pleasing and plausible colorizations.

Implications and Future Directions

ColTran's approach highlights the potential of transformers in tasks beyond traditional natural language processing, extending into image processing domains effectively. The architecture sets a framework that might inspire further research into transformer-based solutions for other generative tasks. The method's ability to produce diverse outputs underscores its applicability in fields such as media restoration, virtual reality, and creative industries where flexible colorization options are valued.

The introduction of conditional transformer layers could be explored in various other conditional generative frameworks beyond colorization. The successful application of ColTran on ImageNet sets a precedent for its use in diverse datasets and domains, potentially adjusting the paradigm for how high-dimension visual data is processed and generated.

In terms of future development, focus areas could include optimizing the model for inference speed, adapting the architecture for even higher resolution tasks, and exploring its transferability to video colorization tasks, where temporal consistency must be maintained across frames.

Conclusion

The Colorization Transformer stands as an effective image colorization model, leveraging the power of conditional transformers and self-attention. It presents a robust framework not only for advancing colorization techniques but also for exploring transformer applications in other areas of computer vision and generative modeling.

PDF Markdown