Transformer for Image Quality Assessment

Published 30 Dec 2020 in cs.CV, cs.LG, and eess.IV | (2101.01097v2)

Abstract: Transformer has become the new standard method in NLP, and it also attracts research interests in computer vision area. In this paper we investigate the application of Transformer in Image Quality (TRIQ) assessment. Following the original Transformer encoder employed in Vision Transformer (ViT), we propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN). Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions. Different settings of Transformer architectures have been investigated on publicly available image quality databases. We have found that the proposed TRIQ architecture achieves outstanding performance. The implementation of TRIQ is published on Github (https://github.com/junyongyou/triq).

Abstract PDF Upgrade to Chat

Authors (2)

Citations (157)

View on Semantic Scholar

Summary

The paper proposes TRIQ, a novel architecture that integrates a shallow Transformer encoder with CNN-derived feature maps for enhanced image quality assessment.
It employs adaptive positional embeddings to preserve spatial information in images of varying resolutions without needing resizing.
Experiments on databases like KonIQ-10k and LIVE-wild demonstrate significant improvements in PLCC, SROCC, and RMSE metrics over traditional CNN models.

Transformer-based Image Quality Assessment

This paper explores the application of Transformers, originally devised for NLP tasks, to Image Quality Assessment (IQA), presenting a novel architecture named TRIQ. The research harnesses the attention mechanism of Transformers, synergized with Convolutional Neural Networks (CNNs), to enhance IQA models' capabilities over standard CNN-driven approaches.

Conceptual Framework

The TRIQ model is built upon integrating a shallow Transformer encoder with feature maps obtained from CNNs, specifically employing the Vision Transformer (ViT) configurations. The primary proposition is using adaptive positional embeddings, a crucial component for handling images of varying resolutions without resizing—a common but sometimes detrimental technique due to possible quality degradation.

Distinctively, the TRIQ approach contrasts with previous IQA models that rely heavily on CNN backbones followed by multi-layer perceptrons (MLP). The paper posits that CNNs' inductive biases, beneficial for extracting relevant image features, can be enhanced by a Transformer encoder's ability to capture long-term dependencies through self-attention mechanisms, offering a more holistic understanding of image quality.

Methodology

The design of TRIQ involves several key stages:

Feature Extraction: Utilizes ResNet50 as the CNN backbone to produce a feature map, which serves as the input to the Transformer encoder.
Transformer Encoder: The encoder performs a projection, pooling of the CNN features, and applies positional embeddings, akin to the class token in BERT, to process these features into a contextual representation of image quality.
Positional Embeddings: Adaptive embeddings are employed to retain spatial information, critical for IQA tasks, especially when dealing with images of varying resolutions.
Aggregation and Prediction: An MLP head, following the Transformer encoder, outputs the perceived image quality, using softmax activation on the final layer to predict a distribution over quality grades.

Experimental Evaluation

The authors rigorously tested TRIQ on several image quality databases, including KonIQ-10k and LIVE-wild. Results demonstrated superior performance in terms of PLCC, SROCC, and RMSE when compared to other state-of-the-art models like CaHDC and Koncept512. Specifically, TRIQ achieved a PLCC of 0.923 on the KonIQ-10k dataset and 0.848 on the SPAQ database, asserting its effectiveness and adaptability to different image contents and resolutions.

Moreover, TRIQ provided more consistent quality assessment across varying resolutions without requiring image resizing, a common limitation in previous models. This research distinctly shows that Transformers, when combined with CNN feature extraction, significantly contribute to advancements in IQA, given their potential to address challenges related to image content variance and resolution diversity.

Implications and Future Directions

The study's findings suggest that the Transformer encoder's capabilities to capture complex dependencies in image data could lead to more accurate and generalizable IQA models. The integration of attention mechanisms shows considerable promise for tasks where spatial variance impacts model efficacy, indicating Transformers' utility extends beyond NLP and image recognition tasks.

Future research might explore further optimizing Transformer settings, exploring deeper network architectures, and incorporating additional data modalities to enhance model robustness further. Continued investigation into understanding the attention distribution during inference could yield insights into model interpretability and cognitive correlation with human perceptual quality assessment.

This work paves the way for developing IQA models that can dynamically adapt to a broader spectrum of image resolutions and qualities, bridging a critical gap in computer vision applications leveraging attention-based architectures.

Markdown Report Issue