Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models (2502.09696v2)

Published 13 Feb 2025 in cs.CV

Abstract: Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Summary

  • The paper introduces ZeroBench, a visual reasoning benchmark with 100 curated questions where state-of-the-art multimodal models achieve 0% accuracy.
  • The paper demonstrates a novel evaluation approach that uses multi-step visual questions and error analysis to uncover limitations in counting and spatial reasoning.
  • The paper compares 20 models, revealing that increased reasoning tokens do not improve performance, thus highlighting systemic visual interpretation errors.

The paper introduces ZeroBench, a visual reasoning benchmark designed to be exceptionally challenging for contemporary Large Multimodal Models (LMMs). The authors posit that existing visual benchmarks are rapidly being saturated, necessitating the creation of more difficult evaluations. ZeroBench distinguishes itself by being lightweight, comprising only 100 manually curated questions, and by design, impossible for current frontier State-of-the-Art (SotA) models, which all achieve a score of 0.0%.

The core contributions of the paper include:

  • The introduction of the ZeroBench benchmark, consisting of 100 hand-crafted questions and 334 subquestions to evaluate visual reasoning in LMMs.
  • An evaluation of 20 models on ZeroBench, demonstrating a 0.0% accuracy on the main questions across all models.
  • A detailed error analysis identifying common failure modes related to visual interpretation.

The introduction highlights the rapid progress on existing visual benchmarks and the corresponding need for more challenging evaluations. The authors note that many LMMs exhibit flaws in visual interpretation and reasoning, particularly in low-level tasks such as counting and spatial cognition. They argue that as headroom on existing benchmarks decreases, the benchmarks become less informative, motivating the need for "hard evals." The authors also highlight a trend toward models that spend more time "thinking," leveraging test-time compute scaling and encouraging step-by-step reasoning.

Related works are discussed in the context of visual benchmarks and difficult evaluations. The paper references benchmarks in specific domains, such as scientific figure interpretation, visual coding, geospatial sensing, and medicine, as well as application-agnostic benchmarks that evaluate visual capabilities. The authors highlight the limitations of existing benchmarks, noting that many focus on single-step reasoning in a multiple-choice setting, whereas ZeroBench emphasizes multi-step visual reasoning requiring precise answers.

The paper details the construction of the ZeroBench dataset, emphasizing the challenge of creating "impossible" questions. Questions were manually created by a group of human question creators, with instructions to include a difficult visual component, multi-step reasoning, and maximum challenge. The curation pipeline involved feedback, initial evaluation using models like o1 pro and QVQ, review to ensure answerability and difficulty, and adversarial filtering to remove questions correctly answered by any of the baseline models. The paper also provides statistics on ZeroBench, including the number of questions and subquestions, the proportion of single-image and multi-image questions, and the distribution of question length and image size.

The experimental setup section outlines the models, prompting strategies, hyperparameters, and evaluation metrics used. The benchmarked models include proprietary models such as o1 and Gemini 2 Flash Thinking, as well as open-weight models like Llama 3.2 90B and Qwen2-VL-72B-Instruct. The prompting strategy involved a simple conversational prompt with a zero-shot Chain-of-Thought (CoT) phrase. The evaluation metrics included accuracy, mean accuracy on subquestions, pass@k, and k/k reliability. The authors detail the inference procedure, including the use of greedy decoding and stochastic decoding.

The experimental results section presents the performance of the evaluated models on ZeroBench. The key finding is that all models score 0% pass@1 on the main questions, confirming the benchmark's difficulty. However, non-zero performance is observed in the pass@5 setting, indicating that some questions are within reach for some models. The subquestions are found to differentiate model performance, with Claude Sonnet 3.5 v2 achieving the highest score. The authors also compare proprietary and open-weight models, noting a performance gap on the subquestions.

Further analysis includes completion tokens and error analysis. The paper records the number of output tokens used by each model, finding that reasoning models use significantly more tokens without a corresponding improvement in performance. Error analysis reveals that errors are skewed towards visual interpretation rather than logical reasoning, indicating that ZeroBench is effective as a visual reasoning benchmark. The paper discusses recurrent visual interpretation errors, such as incorrectly counting objects and difficulties understanding spatial relations.

The paper concludes by discussing the future outlook for ZeroBench, including evaluating new models, predicting the timeline for progress on the benchmark, and creating difficult questions. The authors suggest that a breakthrough allowing for higher resolution inputs could lead to gains on ZeroBench.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com