Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels (2312.17090v1)

Published 28 Dec 2023 in cs.CV, cs.CL, and cs.LG

Abstract: The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

Evaluating Visual Content: Text-Defined Levels in Multi-Modality Models

In the field of visual content evaluation, the application of large multi-modality models (LMMs) has gained substantial attention due to their extensive potential in bridging visual and natural language understanding. The paper, "Q-A LIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels," introduces an innovative approach focusing on enhancing the interpretability and alignment of machine-generated scores with human preferences by using discrete text-defined rating levels.

Methodology and Implementation

The central innovation of this paper lies in shifting from numerical scores to text-defined rating levels, such as excellent, good, fair, poor, and bad, as the basis for imparting judgment to LMMs, particularly for image and video quality assessment tasks. This methodology emulates human rating practices in subjective studies wherein raters do not conventionally provide precise scores but categorize items into qualitative levels. Leveraging this insight, Q-A LIGN employs text-defined level ratings as a training target, thereby simplifying the cognitive load on LMMs, which naturally aligns with their designed purpose of human-like textual comprehension and generation.

During training, existing datasets' mean opinion scores (MOS) are mapped to text levels using equidistant interval partitioning. At inference, a softmax pooling approach extracts log probabilities over these discrete levels to derive the LMM-predicted score, achieving an overview similar to human rating processes.

Experimental Evaluation

Using datasets across various domains—image quality assessment (IQA), image aesthetic assessment (IAA), and video quality assessment (VQA)—the implementation of Q-A LIGN demonstrates significant advancements over state-of-the-art methodologies. The proposed methodology achieves not only competitive performance with a fraction of the annotated data but also markedly improves generalization to unseen data (out-of-distribution datasets). The paper discusses results on prominent benchmarks such as KonIQ, SPAQ, and LSVQ, showcasing superior performance across both typical and cross-dataset settings.

Implications and Future Directions

The implications of adopting discrete-level-based syllabuses extend beyond mere scoring accuracy; they encompass enhanced generalization capabilities and data efficiency. This approach underscores a paradigm shift in visual scoring that leans heavily on emulating human judgment processes over mechanical score prediction. Moreover, the successful unification of IQA, IAA, and VQA tasks under a single LMM framework, termed the O NE A LIGN, suggests potential for future multitask learning models that require less task-specific tuning while being more robust to diverse input domains.

Additionally, the paper highlights this method's capacity to freely combine disparate datasets without performance declines, an advantage over most current models subjected to constrained dataset environments. This characteristic points towards the possibility of creating versatile, generalized models that could handle a myriad of visual content evaluation scenarios with improved accuracy and consistency.

Conclusion

This research presents Q-A LIGN as a viable approach for training large-scale LMMs on visual scoring tasks using discrete text-defined levels—a method that has manifested improved performance and generalization on several visual scoring benchmarks. Looking forward, the fusion of text-defined level-based learning with LLMs may serve as a critical cornerstone in the evolution of visual quality assessment systems, offering structured adaptability and interpretability as integral model features. Such advancements open new avenues for sophisticated, human-aligned machine evaluations across an ever-expanding spectrum of visual content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Recommendation 500-10: Methodology for the subjective assessment of the quality of television pictures. ITU-R Rec. BT.500, 2000.
  2. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  3. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  4. Perceptual quality assessment of smartphone photography. In CVPR, 2020.
  5. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  6. Aesthetic image captioning from weakly-labelled photographs. arXiv preprint arXiv:1908.11310, 2019.
  7. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, June 2018.
  8. Effective aesthetics prediction with multi-level spatially pooled features. In CVPR, pp.  9367–9375, 2019. doi: 10.1109/CVPR.2019.00960.
  9. Towards transparent deep image aesthetics assessment with tag-based content descriptors. IEEE TIP, 2023.
  10. Musiq: Multi-scale image quality transformer. In ICCV, pp.  5148–5157, 2021.
  11. Vila: Learning image aesthetics from user comments with vision-language pretraining, 2023.
  12. Photo aesthetics ranking network with attributes and content adaptation. In ECCV, 2016.
  13. Korhonen, J. Two-level approach for no-reference consumer video quality assessment. IEEE TIP, 28(12):5923–5938, 2019.
  14. LAION. Aesthetic predictor. https://github.com/LAION-AI/aesthetic-predictor, 2023.
  15. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE TCSVT, 2022.
  16. Agiqa-3k: An open database for ai-generated image quality assessment, 2023.
  17. Quality assessment of in-the-wild videos. In ACM MM, pp.  2351–2359, 2019.
  18. Improved baselines with visual instruction tuning, 2023a.
  19. Visual instruction tuning, 2023b.
  20. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  21. No-reference image quality assessment in the spatial domain. IEEE TIP, 21(12), 2012.
  22. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
  23. Ava: A large-scale database for aesthetic visual analysis. In CVPR, pp.  2408–2415, 2012.
  24. Language models are unsupervised multitask learners, 2019.
  25. Learning transferable visual models from natural language supervision, 2021.
  26. Live image quality assessment database release 2. http://live.ece.utexas.edu/research/quality, 2005.
  27. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In CVPR, June 2020.
  28. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  856–865, 2022.
  29. Nima: Neural image assessment. IEEE TIP, 2018.
  30. Llama 2: Open foundation and fine-tuned chat models, 2023.
  31. Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE TIP, 30:4449–4464, 2021a.
  32. Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing, 2:425–440, 2021b.
  33. Exploring clip for assessing the look and feel of images, 2022.
  34. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
  35. Discovqa: Temporal distortion-content transformers for video quality assessment.
  36. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In ECCV, 2022.
  37. Neighbourhood representative sampling for efficient end-to-end video quality assessment. IEEE TPAMI, 2023a.
  38. Exploring opinion-unaware video quality assessment with semantic affinity criterion. In International Conference on Multimedia and Expo (ICME), 2023b.
  39. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment, 2023c.
  40. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In ICCV, 2023d.
  41. Q-bench: A benchmark for general-purpose foundation models on low-level vision. 2023e.
  42. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783, 2023f.
  43. mplug-owl: Modularization empowers large language models with multimodality, 2023a.
  44. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
  45. Patch-vq: ’patching up’ the video quality problem. In CVPR, 2021.
  46. Coca: Contrastive captioners are image-text foundation models. 2022.
  47. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition, 2023a.
  48. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE TCSVT, 30(1):36–47, 2020.
  49. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In IEEE Conference on Computer Vision and Pattern Recognition, 2023b.
  50. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  51. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Haoning Wu (68 papers)
  2. Zicheng Zhang (124 papers)
  3. Weixia Zhang (19 papers)
  4. Chaofeng Chen (41 papers)
  5. Liang Liao (36 papers)
  6. Chunyi Li (66 papers)
  7. Yixuan Gao (17 papers)
  8. Annan Wang (12 papers)
  9. Erli Zhang (11 papers)
  10. Wenxiu Sun (59 papers)
  11. Qiong Yan (39 papers)
  12. Xiongkuo Min (138 papers)
  13. Guangtao Zhai (230 papers)
  14. Weisi Lin (118 papers)
Citations (73)