Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective (2303.14968v1)

Published 27 Mar 2023 in cs.CV and cs.LG

Abstract: We aim at advancing blind image quality assessment (BIQA), which predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions. Through comprehensive experiments on learning three tasks - BIQA, scene classification, and distortion type identification, we verify that the proposed BIQA method 1) benefits from the scene classification and distortion type identification tasks and outperforms the state-of-the-art on multiple IQA datasets, 2) is more robust in the group maximum differentiation competition, and 3) realigns the quality annotations from different IQA datasets more effectively. The source code is available at https://github.com/zwx8981/LIQE.

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

The paper presents a novel approach to blind image quality assessment (BIQA) leveraging the synergy between vision and language. The authors propose a multitask learning paradigm to incorporate auxiliary tasks—scene classification and distortion type identification—into the BIQA framework. This multitask strategy is designed to enhance image quality prediction by utilizing complementary information from related tasks.

At the core of the proposed method is the integration of vision-LLMs, specifically CLIP, to establish a joint probability framework. The model is trained to predict multiple tasks by formulating the task labels as textual descriptions processed by a language encoder. This approach elegantly encapsulates task-specific information in a shared embedding space, facilitating effective multitask learning.

Significant experimental results substantiate the merit of the approach. Across multiple IQA datasets, the model—denoted as LIQE—either outperforms or matches the state-of-the-art, achieving superior results particularly in cross-distortion scenarios. The authors demonstrate that LIQE's joint optimization not only benefits BIQA but also aligns mean opinion scores (MOS) from diverse datasets, promoting consistency across different perceptual scales.

Numerically, LIQE achieves outstanding SRCC and PLCC values, evidencing its robustness and predictive precision. For instance, on challenging datasets like KADID-10k and KonIQ-10K, LIQE attains median SRCC results of 0.930 and 0.919, respectively, asserting its efficacy in handling both synthetic and realistic distortions. Furthermore, an engaging discussion unfolds around the use of joint vision-language embeddings, showcasing how they assist in drawing latent features beneficial for high-level quality assessment tasks.

The paper also explores ablation studies, demonstrating the significance of using a multitask setup and dynamic loss weighting. The analysis suggests that the inclusion of auxiliary tasks, such as distortion type identification, provides beneficial context that reinforces BIQA performance.

The implications of this work are multifaceted: theoretically, it sets a precedent for utilizing multitask learning frameworks in BIQA by showcasing the seamless fusion of vision and language; practically, it introduces a compelling model, LIQE, capable of adapting to various distortions with enhanced robustness.

As AI continues to progress, the paper provides a substantive foundation for future research that could explore extending the number of auxiliary tasks or integrating more sophisticated LLMs. The use of vision-language correspondences hints at future developments where machines could more deeply understand and critique visual information analogous to human perception.

In conclusion, this research offers a comprehensive exploration of BIQA through multitask learning, highlighting the complementary strengths of leveraging vision-LLMs to augment the quality assessment process. The innovative methodology and promising results presented within mark a significant contribution to computational vision and inspire further exploration into the potential of language in enhancing vision-based tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weixia Zhang (19 papers)
  2. Guangtao Zhai (230 papers)
  3. Ying Wei (80 papers)
  4. Xiaokang Yang (207 papers)
  5. Kede Ma (57 papers)
Citations (119)