PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild (2405.17765v1)
Abstract: Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, \eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method.
- Vivit: A video vision transformer. CoRR, abs/2103.15691, 2021.
- Cisco visual networking index (vni) complete forecast update, 2017–2022. Americas/EMEAR Cisco Knowledge Network (CKN) Presentation, 2018.
- Is space-time attention all you need for video understanding? In ICML, pages 813–824. PMLR, 2021.
- Language models are few-shot learners. In NeurIPS, 2020.
- A short note on the kinetics-700 human action dataset. CoRR, abs/1907.06987, 2019.
- No-reference image quality assessment by hallucinating pristine features. IEEE Trans. Image Process., 2022a.
- Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment. IEEE Trans. Circuits Syst. Video Technol., 32(4):1903–1916, 2022b.
- Rirnet: Recurrent-in-recurrent network for video quality assessment. In ACM Multimedia, pages 834–842. ACM, 2020.
- Contrastive self-supervised pre-training for video quality assessment. IEEE Trans. Image Process., 31:458–471, 2022c.
- From qos to qoe: A tutorial on video quality assessment. IEEE Commun. Surv. Tutorials, 17(2):1126–1165, 2015.
- Objective video quality assessment methods: A classification, review, and performance comparison. IEEE Trans. Broadcast., 57(2):165–182, 2011.
- MMAction2 Contributors. Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2, 2020.
- Feature selection for neural-network based no-reference video quality assessment. In ICANN (2), pages 633–642. Springer, 2009.
- A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):224–227, 1979.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
- metrics and methods of video quality assessment: a brief review. Multim. Tools Appl., 78(22):31019–31033, 2019.
- Christoph Feichtenhofer. X3D: expanding architectures for efficient video recognition. In CVPR, pages 200–210. Computer Vision Foundation / IEEE, 2020.
- No-reference image quality assessment via transformers, relative ranking, and self-consistency. In WACV, pages 1220–1230, 2022.
- Masked autoencoders are scalable vision learners. CoRR, abs/2111.06377, 2021.
- The konstanz natural video database (konvid-1k). In QoMEX, pages 1–6. IEEE, 2017.
- Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- MUSIQ: multi-scale image quality transformer. In ICCV, pages 5128–5137. IEEE, 2021.
- Video quality assessment: Some remarks on selected objective metrics. In SoftCOM, pages 1–6. IEEE, 2020.
- Jari Korhonen. Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process., 28(12):5923–5938, 2019.
- No-reference video quality assessment using distortion learning and temporal attention. IEEE Access, 10:41010–41022, 2022.
- Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE Trans. Circuits Syst. Video Technol., 32(9):5944–5958, 2022.
- Quality assessment of in-the-wild videos. In ACM Multimedia, pages 2351–2359. ACM, 2019.
- Unified quality assessment of in-the-wild videos with mixed datasets training. Int. J. Comput. Vis., 129(4):1238–1257, 2021a.
- Benchmarking detection transfer learning with vision transformers. CoRR, abs/2111.11429, 2021b.
- Ada-dqa: Adaptive diverse quality-aware feature acquisition for video quality assessment. In ACM Multimedia, pages 6695–6704. ACM, 2023.
- End-to-end blind quality assessment of compressed videos using deep neural networks. In ACM Multimedia, pages 546–554. ACM, 2018.
- Rankiqa: Learning from rankings for no-reference image quality assessment. In ICCV, pages 1040–1049. IEEE Computer Society, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 9992–10002. IEEE, 2021.
- A convnet for the 2020s. CoRR, abs/2201.03545, 2022.
- No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process., 21(12):4695–4708, 2012.
- Making a "completely blind" image quality analyzer. IEEE Signal Process. Lett., 20(3):209–212, 2013.
- A completely blind video integrity oracle. IEEE Trans. Image Process., 25(1):289–300, 2016.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
- No-reference nonuniform distorted video quality assessment based on deep multiple instance learning. IEEE Multim., 28(1):28–37, 2021.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- ITUT Rec. P. 800.1, mean opinion score (mos) terminology. International Telecommunication Union, Geneva, 2006.
- Blind prediction of natural video quality. IEEE Trans. Image Process., 23(3):1352–1365, 2014.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823. IEEE Computer Society, 2015.
- No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP J. Image Video Process., 2014:40, 2014.
- Large-scale study of perceptual video quality. IEEE Trans. Image Process., 28(2):612–627, 2019.
- Blindly assess image quality in the wild guided by a self-adaptive hyper network. In CVPR, pages 3664–3673. Computer Vision Foundation / IEEE, 2020.
- Video classification with channel-separated convolutional networks. In ICCV, pages 5551–5560. IEEE, 2019.
- Frank: a ranking method with fidelity loss. In SIGIR, pages 383–390. ACM, 2007.
- UGC-VQA: benchmarking blind video quality assessment for user generated content. IEEE Trans. Image Process., 30:4449–4464, 2021a.
- RAPIQUE: rapid and accurate video quality prediction of user generated content. CoRR, abs/2101.10955, 2021b.
- Laurens van der Maaten. Learning a parametric embedding by preserving local structure. In AISTATS, pages 384–391. JMLR.org, 2009.
- Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, pages 5022–5030. Computer Vision Foundation / IEEE, 2019a.
- Youtube UGC dataset for video compression research. In MMSP, pages 1–5. IEEE, 2019b.
- Rich features for perceptual quality assessment of UGC videos. In CVPR, pages 13435–13444. Computer Vision Foundation / IEEE, 2021.
- Video quality assessment using a statistical model of human visual speed perception. JOSA A, 24(12):B61–B69, 2007.
- Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, 2009.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Stefan Winkler. Issues in vision modeling for perceptual video quality assessment. Signal Process., 78(2):231–252, 1999.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, pages 23965–23998. PMLR, 2022.
- FAST-VQA: efficient end-to-end video quality assessment with fragment sampling. In ECCV (6), pages 538–554. Springer, 2022.
- Starvqa: Space-time attention for video quality assessment. CoRR, abs/2108.09635, 2021.
- Perceptual quality assessment of internet videos. In ACM Multimedia, pages 1248–1257. ACM, 2021.
- A novel objective no-reference metric for digital video quality assessment. IEEE Signal Process. Lett., 12(10):685–688, 2005.
- Patch-vq: ’patching up’ the video quality problem. In CVPR, pages 14019–14029. Computer Vision Foundation / IEEE, 2021.
- Junyong You. Long short-term convolutional transformer for no-reference video quality assessment. In ACM Multimedia, pages 2112–2120. ACM, 2021.
- Capturing co-existing distortions in user-generated content for no-reference video quality assessment. In ACM Multimedia, pages 1098–1107. ACM, 2023.
- Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process., 30:3474–3486, 2021.