No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency (2108.06858v2)

Published 16 Aug 2021 in eess.IV and cs.CV

Abstract: The goal of No-Reference Image Quality Assessment (NR-IQA) is to estimate the perceptual image quality in accordance with subjective evaluations, it is a complex and unsolved problem due to the absence of the pristine reference image. In this paper, we propose a novel model to address the NR-IQA task by leveraging a hybrid approach that benefits from Convolutional Neural Networks (CNNs) and self-attention mechanism in Transformers to extract both local and non-local features from the input image. We capture local structure information of the image via CNNs, then to circumvent the locality bias among the extracted CNNs features and obtain a non-local representation of the image, we utilize Transformers on the extracted features where we model them as a sequential input to the Transformer model. Furthermore, to improve the monotonicity correlation between the subjective and objective scores, we utilize the relative distance information among the images within each batch and enforce the relative ranking among them. Last but not least, we observe that the performance of NR-IQA models degrades when we apply equivariant transformations (e.g. horizontal flipping) to the inputs. Therefore, we propose a method that leverages self-consistency as a source of self-supervision to improve the robustness of NRIQA models. Specifically, we enforce self-consistency between the outputs of our quality assessment model for each image and its transformation (horizontally flipped) to utilize the rich self-supervisory information and reduce the uncertainty of the model. To demonstrate the effectiveness of our work, we evaluate it on seven standard IQA datasets (both synthetic and authentic) and show that our model achieves state-of-the-art results on various datasets.

PDF Abstract

An Overview of "No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency"

This paper addresses the complex problem of No-Reference Image Quality Assessment (NR-IQA), which aims to estimate perceptual image quality without a pristine reference image. The proposed method leverages a hybrid architecture that combines Convolutional Neural Networks (CNNs) with the self-attention mechanism of Transformers to improve the extraction of both local and non-local image features.

Methodology and Contributions

Hybrid Model Design: The core of the proposed model involves using CNNs to capture local image structures while employing Transformers to overcome the locality bias of CNNs by modeling non-local representations. This combination allows for a more comprehensive feature extraction process that accommodates both low- and high-level image quality cues.
Relative Ranking: Recognizing the intrinsic ranking relationships within image batches, the paper introduces a relative ranking loss that enforces these relations. This loss helps guide the model by ensuring that predictions respect the subjective order of image quality even if absolute scores are not perfectly predicted.
Self-Consistency: The authors observe a degradation in model performance when equivariant transformations, such as horizontal flipping, are applied to inputs. To counter this, they propose a self-supervisory mechanism enforcing consistency between the quality assessments of original and transformed images. This approach aims to bolster model robustness by reducing prediction uncertainty due to such transformations.

The proposed model demonstrates state-of-the-art results on seven benchmark IQA datasets, including both synthetic and authentic distortion scenarios. The effectiveness is particularly notable on large datasets like LIVEFB and KADID, reflecting the model's scalability and robustness.

Detailed Insights

Feature Extraction with CNNs and Transformers: The CNN handles the localized feature detection, essential for identifying specific artifacts, while Transformers focus on capturing global contextual information through self-attention across all feature scales. This architecture effectively balances feature locality and contextuality, crucial for NR-IQA tasks.
Triplet Loss for Ranking: By leveraging adaptive margins based on human quality scores, the paper introduces a triplet loss that penalizes incorrect quality rankings within image batches. This strategy enhances the alignment between model predictions and human visual assessments.
Self-Consistency as a Validation Strategy: The self-consistency mechanism not only aligns predictions for augmented views but also inherently validates model stability against data perturbations that are perceptually irrelevant, thus mimicking human perception attributes more closely.

Implications and Future Directions

The integration of Transformers into the NR-IQA domain opens new possibilities for image quality prediction by facilitating complex feature interactions that CNNs alone cannot achieve. This method could inspire further exploration into hybrid models across varied computer vision tasks.

The inclusion of relative ranking and self-consistency mechanisms highlights an emerging trend towards more holistic training paradigms that incorporate human-like evaluation strategies. Future research might explore expanding such strategies to accommodate additional transformations or multi-domain applications.

As NR-IQA tasks grow in complexity and importance, particularly for applications ranging from social media to autonomous vehicles, methodologies that leverage advanced neural architectures and innovative training strategies will likely become instrumental. This work presents a significant step in that direction, providing both a methodological framework and empirical validation against benchmark challenges.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

S. Alireza Golestaneh (9 papers)
Saba Dadsetan (4 papers)
Kris M. Kitani (46 papers)

Citations (198)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos