BLINK: Multimodal Large Language Models Can See but Not Perceive (2404.12390v4)

Published 18 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: We introduce Blink, a new benchmark for multimodal LLMs that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

PDF Abstract

"BLINK: Multimodal LLMs Can See but Not Perceive"

Introduction to the Paper's Contributions and Findings

This research provides a critical examination of multimodal LLMs through the lens of visual perception. It asserts that while LLMs can process visual inputs, their ability to perceive and understand these inputs in a nuanced manner is limited. This is demonstrated through a carefully designed benchmark named BLINK, which encompasses 14 vision tasks traditionally used to assess human perceptual capabilities.

Experimental Design and Benchmark Description

BLINK aims to bridge the gap between traditional computer vision tasks and the capabilities of current multimodal LLMs. The benchmark is unique in that it includes a range of tasks that extend beyond mere recognition or description, necessitating deeper perceptual understanding such as depth estimation, optical reflection understanding, and visual correspondence among others. Each task is formatted as a multiple-choice question, involving 3.9K questions across 7.3K images.

Key Features of BLINK

Diverse Visual Prompts: Unlike most benchmarks that rely solely on textual prompts, BLINK incorporates various forms of visual prompts to assess models' understanding of specific image regions.
Comprehensive Visual Perception: Tasks cover a spectrum from low-level pattern matching to high-level conceptual reasoning, requiring a nuanced understanding of visual information.
Rapid Human Solvability: The tasks are designed to be solvable by humans within seconds, testing the immediate perceptual capabilities that LLMs are hypothesized to lack.
Mixed Answer Formats: Offers a mix of image-based and text-based answer choices, challenging the model's ability to navigate and interpret multimodal information effectively.

Evaluation and Findings

The paper evaluates 17 different multimodal LLMs of varying capacities from 7 billion to 34 billion parameters. The results indicate a stark contrast between human performance, which averages at 95.70% accuracy across tasks, and the best-performing LLM, which reaches only 51.26%—a gap of approximately 44.44%. Even the specialist vision models, when tested, significantly outperform LLMs across various tasks, emphasizing the existing perceptual limitations of these multimodal systems.

Implications and Future Research Directions

The findings illustrate a critical need for enhancement in the perceptual capabilities of multimodal LLMs. It suggests potential benefits from hybrid approaches integrating specialist models' strengths into LLM architectures. The paper also emphasizes the necessity for designing more rigorous benchmarks like BLINK that test beyond simple recognition and probe deeper cognitive and perceptual understanding in AI systems.

The research challenges the community to rethink how multimodal LLMs are assessed and encourages further explorations into developing models that can not only 'see' but genuinely 'perceive' and make sense of visual information akin to how humans do.

Conclusion

BLINK serves as an essential step towards understanding and subsequently bridging the perceptual gap between human cognitive capabilities and current AI technologies. It provides a robust framework for future innovations aimed at enhancing the perceptual depth of AI, guiding the development of more capable and genuinely understanding multimodal systems.