A strong baseline for image and video quality assessment

Published 13 Nov 2021 in eess.IV, cs.CV, and cs.LG | (2111.07104v1)

Abstract: In this work, we present a simple yet effective unified model for perceptual quality assessment of image and video. In contrast to existing models which usually consist of complex network architecture, or rely on the concatenation of multiple branches of features, our model achieves a comparable performance by applying only one global feature derived from a backbone network (i.e. resnet18 in the presented work). Combined with some training tricks, the proposed model surpasses the current baselines of SOTA models on public and private datasets. Based on the architecture proposed, we release the models well trained for three common real-world scenarios: UGC videos in the wild, PGC videos with compression, Game videos with compression. These three pre-trained models can be directly applied for quality assessment, or be further fine-tuned for more customized usages. All the code, SDK, and the pre-trained weights of the proposed models are publicly available at https://github.com/Tencent/CenseoQoE.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a simple yet powerful baseline model that leverages a ResNet-18 backbone with global average pooling for assessing image and video quality.
It employs a hybrid loss function combining MAE and pair-wise ranking loss to enhance convergence and generalization.
The model outperforms state-of-the-art methods on diverse datasets, enabling scalable quality evaluation for UGC, PGC, and gaming videos.

Evaluation of a Unified Model for Image and Video Quality Assessment

The paper "A Strong Baseline for Image and Video Quality Assessment" presents the development and evaluation of a straightforward unified model designed for the perceptual quality assessment of images and videos. This research is crucial within the domain of image and video processing, where quality assessment serves as a cornerstone for numerous applications, including video compression, quality monitoring, and video recommendation systems. The proliferation of User-Generated Content (UGC) has prompted a need for effective quality assessment models that can operate efficiently at scale.

Overview of Methodology

The proposed model distinguishes itself through simplicity and efficiency, utilizing a global feature derived from a well-established backbone network, ResNet-18, instead of complex architectures or concatenated multi-branch features. This approach reduces the computational demands while maintaining effective performance, evidenced by surpassing state-of-the-art baselines on both public and private datasets. Pre-trained models are released for three scenarios: UGC videos in the wild, Pro-Generated Content (PGC) videos with compression, and game videos with compression, facilitating both direct application and fine-tuning capabilities.

Technical Contributions

Network Architecture: The model leverages a lightweight CNN architecture with a Global Averaged Pooling (GAP) mechanism to extract features, followed by fully connected layers to predict quality scores. Variants of the model cater to both full-reference (FR) and no-reference (NR) quality assessment techniques, with distinct processing of inputs in each scenario.
Loss Function: A hybrid loss function combining Mean Absolute Error (MAE) and a pair-wise ranking loss is introduced to enhance model performance. This formulation tolerates ranking discrepancies against subjective visual quality metrics, bolstering convergence and generalization.
Training Techniques: Sophisticated training paradigms, including cosine annealing learning rate schedules, resizing with random cropping, and Stochastic Weight Averaging (SWA), are employed to mitigate overfitting and improve model robustness.

Experimental Results

The model was evaluated on well-known datasets (LIVE-VQC, KoNViD-1K, YouTube-UGC) and private datasets specific to real-world scenarios. The results underscore its efficacy with PLCC and SRCC values rivaling or outperforming existing models, establishing the proposed model as a robust baseline in image/video quality assessment research.

Implications and Future Directions

The model's design aims to balance performance with reduced complexity, making it apt for industrial deployment where computational efficiency and rapid evaluation are paramount. The publicly available models provide a resource for further research, offering a foundation upon which new models can build or against which they can be compared.

In conclusion, this paper presents a significant contribution to the field of image and video quality assessment by establishing a strong and efficient baseline. Potential avenues for future work include extending the model to handle more diverse and complex video qualities, as well as integration with dynamic video streaming and content delivery platforms to enhance real-time application potential. The release of the models encourages widespread adoption and adaptation, fostering advancements in quality assessment methodologies.

Markdown Report Issue