Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision (2505.03631v2)

Published 6 May 2025 in cs.CV

Abstract: Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbf{iterative self-improvement training strategy}, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset $10\times$ larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.

Summary

Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

The paper titled "Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision" introduces an innovative approach to Video Quality Assessment (VQA) that circumvents the traditional challenges associated with manual annotation of large-scale datasets. The researchers present a self-supervised learning framework designed to assess video quality through pairwise comparisons, leveraging both synthesized and real-world data with automatically generated labels. This paper sets forth a method that potentially revolutionizes the scalability and applicability of VQA models.

Methodology and Framework

The framework employs a learning-to-rank paradigm to train a Large Multimodal Model (LMM) on a substantial dataset comprising 700,000 video pairs. This dataset is constructed by sampling videos from various social media platforms, encompassing a diverse array of content categories and distortions. Two distinct strategies are utilized for automatic annotation: quality pseudo-labeling via existing VQA models and relative quality ranking based on synthetic distortion simulations, categorized into spatial, temporal, and streaming distortions.

The iterative self-improvement training strategy represents a critical component of the framework. Through this process, the model functions as an enhanced judge, refining its annotations iteratively. This approach enables continuous self-enhancement and significantly improves out-of-distribution (OOD) generalization across diverse video content types and distortions—circumventing the limitations of traditional supervised VQA models which often suffer from overfitting due to a lack of training diversity.

Quantitative Results

The paper demonstrates that the self-supervised model's zero-shot performance on the in-domain VQA benchmarks matches or surpasses existing supervised models. Notably, it achieves state-of-the-art results when fine-tuned with human-labeled datasets, highlighting its practical efficacy. The model rivals top-tier supervised VQA methods in standard in-domain tests, outperforming them on OOD datasets, with the iterative strategy contributing to significant gains in assessing high frame-rate distortions and compression artifacts.

Implications and Future Directions

The implications of this research are manifold. Practically, it presents an avenue for developing scalable VQA models devoid of expensive and labor-intensive human annotations. Theoretically, it reinforces the potential of self-supervised learning frameworks in enhancing model resilience and adaptiveness to novel data distributions and distortion types—an essential criterion for future AI-driven applications in dynamic real-world environments.

Looking ahead, the paper suggests exploring automated video pair annotations using domain-specific VQA models, enhancing prompt engineering in LMMs, and leveraging text-to-video generation to further expand the dataset's diversity. Extending the framework to accommodate additional modalities such as images and audio could lead to the creation of generalized quality assessment models applicable across a broader spectrum of media formats.

In summary, this paper lays a foundational framework that could significantly impact ongoing developments in video processing workflows, offering a sustainable pathway towards optimizing end-user Quality of Experience (QoE). The researchers' commitment to publicly releasing datasets and code is expected to foster further exploration and innovation in the autonomy of VQA systems.