Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision (2309.14181v3)

Published 25 Sep 2023 in cs.CV, cs.AI, and cs.MM

Abstract: The rapid evolution of Multi-modality LLMs (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Recommendation 500-10: Methodology for the subjective assessment of the quality of television pictures. ITU-R Rec. BT.500, 2000.
  2. nocaps: novel object captioning at scale. In ICCV, 2019.
  3. VQA: Visual Question Answering. In ICCV, 2015.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  6. Microsoft coco captions: Data collection and evaluation server, 2015.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  9. Perceptual quality assessment of smartphone photography. In CVPR, 2020.
  10. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  11. Massive online crowdsourced study of subjective and objective picture quality. IEEE, 25(1):372–387, 2016.
  12. Atqam/mast’20: Joint workshop on aesthetic and technical quality assessment of multimedia and media analytics for societal trends. In ACM MM, pp.  4758–4760, 2020.
  13. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE TIP, 29:4041–4056, 2020.
  14. Towards transparent deep image aesthetics assessment with tag-based content descriptors. IEEE TIP, 2023.
  15. Huggingface. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/idefics.
  16. Objective quality assessment of multiply distorted images. In ASILOMAR, pp.  1693–1697, 2012.
  17. Photo aesthetics ranking network with attributes and content adaptation. In ECCV, 2016.
  18. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  19. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  20. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023b.
  21. Agiqa-3k: An open database for ai-generated image quality assessment, 2023c.
  22. Which has better visual quality: The clear blue sky or a blurry animal? IEEE TMM, 21(5):1221–1234, 2019.
  23. Kadid-10k: A large-scale artificially distorted iqa database. In QoMEX, pp.  1–3, 2019.
  24. Improved baselines with visual instruction tuning, 2023a.
  25. Visual instruction tuning, 2023b.
  26. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  27. Evaluation and mitigation of agnosia in multimodal large language models, 2023.
  28. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  29. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484, 2019.
  30. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013.
  31. Ava: A large-scale database for aesthetic visual analysis. In CVPR, pp.  2408–2415, 2012.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
  34. Learning transferable visual models from natural language supervision, 2021.
  35. Koniq++ : Boosting no-reference image quality assessment in the wild by jointly predicting image quality and defects. In The British Machine Vision Conference (BMVC), pp.  1–12, 2021.
  36. Nima: Neural image assessment. IEEE TIP, 2018.
  37. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  38. Llama: Open and efficient foundation language models, 2023.
  39. Exploring clip for assessing the look and feel of images, 2022.
  40. Rich features for perceptual quality assessment of ugc videos. In CVPR, pp.  13435–13444, June 2021.
  41. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In ECCV, 2022.
  42. Neighbourhood representative sampling for efficient end-to-end video quality assessment, 2023a.
  43. Exploring opinion-unaware video quality assessment with semantic affinity criterion. In International Conference on Multimedia and Expo (ICME), 2023b.
  44. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In ICCV, 2023c.
  45. Towards explainable video quality assessment: A database and a language-prompted approach. In ACM MM, 2023d.
  46. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  47. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  48. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In CVPR, 2020.
  49. Patch-vq: ’patching up’ the video quality problem. In CVPR, 2021.
  50. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  51. Exploring and evaluating image restoration potential in dynamic scenes. In CVPR, pp.  2057–2066, 2022.
  52. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition, 2023a.
  53. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp.  586–595, 2018.
  54. Subjective and objective quality assessment for in-the-wild computer graphics images. ACM TOMM, 2023b.
  55. Advancing zero-shot digital human quality assessment through text-prompted evaluation, 2023c.
  56. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  57. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Haoning Wu (68 papers)
  2. Zicheng Zhang (124 papers)
  3. Erli Zhang (11 papers)
  4. Chaofeng Chen (41 papers)
  5. Liang Liao (36 papers)
  6. Annan Wang (12 papers)
  7. Chunyi Li (66 papers)
  8. Wenxiu Sun (59 papers)
  9. Qiong Yan (39 papers)
  10. Guangtao Zhai (230 papers)
  11. Weisi Lin (118 papers)
Citations (105)