- The paper introduces VisIT-Bench, a benchmark that evaluates instruction-following vision-language models on 70 realistic task families with 592 queries.
- It employs both human and GPT-4 automated assessments to reveal notable competency gaps, with LLaMA-Adapter-v2 winning only 27.4% against human references.
- The benchmark’s flexible framework supports model submissions and continuous improvements, advancing real-world AI evaluation and multimodal research.
VisIT-Bench: A Vision-Language Instruction Benchmark for Real-World Applications
The paper introduces VisIT-Bench, a benchmark specifically designed to evaluate instruction-following vision-LLMs under realistic conditions. The research addresses the longstanding challenge in artificial intelligence: developing general-purpose assistants capable of executing diverse, previously unseen tasks in collaboration with humans. Unlike conventional benchmarks that focus on fixed task-specific performance, VisIT-Bench offers a dynamic testing ground, encompassing a wide array of 70 instruction families that mirror real-world applications.
The benchmark includes 592 challenging test queries, each paired with an instruction-conditioned caption. These captions ensure the precision of multimodal evaluations, allowing both human-verified and automatic assessments. The benchmark stands out by extending beyond standard evaluations like VQAv2 and COCO, exploring tasks that range from basic recognition to complex reasoning, creative generation, and game playing. A key feature of VisIT-Bench is its facilitation of participant engagement through model submission on the project's website, with available data, code, and a leaderboard.
The methodology for VisIT-Bench’s creation involved data curation based on projected capabilities of instruction-tuned vision-LLMs. This leads to a comprehensive dataset covering ten instances per instruction family, totaling 1,159 public images. Human annotators were employed for the collection and verification of responses, ensuring quality outputs that surpass standard automated systems. Annotations revealed that instruction-conditioned captions significantly improved task comprehension and completion.
Through empirical analysis, the paper demonstrates substantial model competency gaps using both human evaluations and automated systems. Notably, the LLaMA-Adapter-v2 model displayed a win rate against human-verified references of only 27.4%, highlighting the current limitations of instruction-following models relative to human judgment. Automated evaluations, involving GPT-4 assessments, aligned with human preferences in 94% of unanimous cases, illustrating the effectiveness of the benchmark's quantitative analysis methods.
The implications of the VisIT-Bench benchmark are multifold. Practically, it provides insights into model performance in scenarios akin to human engagement, enabling a nuanced understanding of model strengths and weaknesses across diverse tasks. Theoretically, it bolsters the vision-language research paradigm, encouraging innovations that address the gap between human and AI capabilities. Given its open-ended nature, VisIT-Bench serves as a platform for continuously evaluating, refining, and advancing multimodal models, promoting transparency and collaborative progress within the AI community.
Future prospects highlighted by the paper suggest expansions in task categories, increased dataset instances per family, inclusion of additional modalities like audio and video, and exploration of multi-turn dialogues to enrich interaction models further. Despite its extensive coverage, the paper acknowledges constraints, notably the present focus on single-turn image-text tasks and the current exclusion of other interaction forms. Despite these limitations, VisIT-Bench stands as a substantial contribution to the evolving landscape of AI evaluation benchmarks. It offers a robust framework for enhancing the alignment of AI models with real-world applications, driving forward the quest for versatile, high-functioning AI systems.