Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training (2011.04263v2)

Published 9 Nov 2020 in cs.CV, cs.MM, and eess.IV

Abstract: Video quality assessment (VQA) is an important problem in computer vision. The videos in computer vision applications are usually captured in the wild. We focus on automatically assessing the quality of in-the-wild videos, which is a challenging problem due to the absence of reference videos, the complexity of distortions, and the diversity of video contents. Moreover, the video contents and distortions among existing datasets are quite different, which leads to poor performance of data-driven methods in the cross-dataset evaluation setting. To improve the performance of quality assessment models, we borrow intuitions from human perception, specifically, content dependency and temporal-memory effects of human visual system. To face the cross-dataset evaluation challenge, we explore a mixed datasets training strategy for training a single VQA model with multiple datasets. The proposed unified framework explicitly includes three stages: relative quality assessor, nonlinear mapping, and dataset-specific perceptual scale alignment, to jointly predict relative quality, perceptual quality, and subjective quality. Experiments are conducted on four publicly available datasets for VQA in the wild, i.e., LIVE-VQC, LIVE-Qualcomm, KoNViD-1k, and CVD2014. The experimental results verify the effectiveness of the mixed datasets training strategy and prove the superior performance of the unified model in comparison with the state-of-the-art models. For reproducible research, we make the PyTorch implementation of our method available at https://github.com/lidq92/MDTVSFA.

Citations (125)

View on Semantic Scholar

Summary

The paper introduces a unified framework for video quality assessment by leveraging mixed datasets training and aligning dataset-specific perceptual scales.
It employs a three-stage process—relative quality assessment, nonlinear mapping with a 4-parameter logistic function, and perceptual scale alignment—to mirror human visual perception.
Experimental results on key datasets show significant improvements in SROCC and PLCC compared to state-of-the-art models, highlighting enhanced prediction accuracy.

Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training: An Overview

This paper presents a novel approach for video quality assessment (VQA) in the wild, utilizing a unified framework that leverages mixed datasets training to tackle the diverse challenges associated with assessing video quality in variable conditions. The authors address a significant gap in the domain of computer vision—how to effectively evaluate video quality when traditional reference videos are unavailable, distortion types are complex, and video content is vastly diverse.

The proposed method is innovative in that it incorporates principles from human perception, specifically focusing on content dependency and the temporal-memory effects of the human visual system. This is crucial as videos captured in real-world (or 'in-the-wild') conditions often display a wide array of unpredictable distortions like motion blur, exposure issues, and noise.

Framework Overview

The authors introduce a unified VQA framework comprising three distinct stages: relative quality assessment, nonlinear mapping, and dataset-specific perceptual scale alignment. This framework supports the mixed datasets training strategy to enhance model robustness across different datasets.

Relative Quality Assessment: This stage predicts the relative quality, focusing on ranking videos in terms of perceived quality. This is important as it aligns with how humans tend to compare visual quality.
Nonlinear Mapping: To address the nonlinearity of human perception, this stage employs a 4-parameter logistic function to map relative quality to perceptual quality. The mapping adjusts for the non-linear response of viewers to varying quality levels, which is a common phenomenon in perceptual evaluations.
Dataset-Specific Perceptual Scale Alignment: Given that subjective quality scores are not uniform across datasets, this stage aligns the predicted perceptual quality with subjective scores specific to each dataset. This is a crucial step to ensure that the model's output is comparable across different datasets with diverse score ranges.

Experimental Results

The proposed model was tested against four prominent datasets: LIVE-VQC, LIVE-Qualcomm, KoNViD-1k, and CVD2014. The outcomes demonstrate the model's superior performance over existing state-of-the-art models, primarily in cross-dataset evaluation scenarios where prior methods have lacked robustness. Specifically, the model shows a marked improvement in Spearman's rank-order correlation coefficient (SROCC) and Pearson's linear correlation coefficient (PLCC), highlighting its effectiveness in maintaining prediction monotonicity and accuracy.

Implications and Future Directions

Practically, this research lays the groundwork for more reliable quality assessment in applications such as video streaming, surveillance, and content creation where video input conditions are not controlled. Theoretically, it showcases an approach to integrate perceptual attributes into machine learning models more deeply.

Looking forward, the research opens several avenues for development. Future work could explore integrating additional perceptual phenomena and enhancing model efficiency through lightweight network architectures. Moreover, there is potential for applying this unified framework to other domains in computer vision and beyond, where diverse datasets need a coherent evaluative approach.

The authors have provided a PyTorch implementation of their method for reproducible research, underscoring their commitment to advancing the field through open collaboration.

PDF Markdown

Related Papers

GitHub

GitHub - lidq92/MDTVSFA: [official] Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training (IJCV 2021) (85 stars)