LazyBench: Diagnosing Multimodal Model Laziness

Updated 27 December 2025

LazyBench is a multimodal benchmark designed to diagnose model laziness by contrasting detailed open-ended image descriptions with simple VQA tasks.
It employs a controlled dataset with four question types per image and uses binary human evaluation to compute metrics like LazyRate and FixRate.
The benchmark reveals that state-of-the-art MLLMs excel in descriptive tasks but underperform in simple queries, with chain-of-thought prompting mitigating about 40% of lazy failures.

LazyBench is a manual multimodal benchmark designed to diagnose and quantify “model laziness” in advanced multimodal LLMs (MLLMs). Model laziness refers to an empirical phenomenon wherein MLLMs excel at open-ended image descriptions but perform poorly on simple visual question-answering (VQA) tasks—such as Yes/No, multiple-choice, or short-answer questions—even when the required information is present in the image and accessible to the model. LazyBench provides a controlled dataset spanning four question types about each image, enabling fine-grained analysis of this behavioral discrepancy and its prevalence across state-of-the-art models (Zhao et al., 2024).

1. Formalization of Model Laziness

Model laziness is defined attribute-wise per item $i$ in LazyBench. Each item incorporates two correctness indicators:

Simple task response: $A^{\text{simple}}_i = 1$ if the model answers a Yes/No, multiple-choice, or short-answer question correctly; otherwise $0$.
Description task response: $A^{\text{desc}}_i = 1$ if the model’s open-ended description is judged correct; otherwise $0$.

A lazy case occurs when the model fails the simple task while succeeding at the description:

$\text{LazyCase}_i = (1 - A^{\text{simple}}_i) A^{\text{desc}}_i.$

The aggregate “lazy rate” over a simple-task subset computes the fraction of simple-task failures that could, in principle, be remedied using the descriptive capability:

$\mathrm{LazyRate} = \frac{\sum_i (1 - A^{\text{simple}}_i) A^{\text{desc}}_i} {\sum_i (1 - A^{\text{simple}}_i)}.$

This metric isolates the cases where model attention to detail, rather than visual capacity, is the key limiting factor.

2. Benchmark Construction and Properties

LazyBench comprises:

101 distinct images, sampled from ImageNet and MMVP.
4 questions per image:
- Yes/No (always “No” for subject verification)
- Multiple-choice (3 shuffled options per question)
- Short-answer (single-token response to subject-centric query)
- Description (open-ended, referencing the ground-truth statement)

Image selection utilizes CLIP embeddings, retaining only pairs with cosine similarity $0.96 < \cos(\mathbf{v}_p, \mathbf{v}_q) < 0.99$ and clear, human-perceptible differences. Question curation ensures subject targeting and filters out items already solvable by GPT-4V to maintain difficulty. All correctness annotations rely on binary human evaluation. The dataset’s per-question structure links each prompt to a validated ground-truth statement, maximizing semantic comparability.

Dataset Statistics

Entity	Count
Images	101
Yes/No	101
Multiple Choice	101
Short Answer	101
Description	101
Total Questions	404

3. Evaluation Protocol and Metrics

All models are evaluated with temperature = 0 to ensure deterministic generation. Metrics assessed for each question type include:

Accuracy: fraction of exactly correct answers (simple tasks and descriptions).
LazyRate: as above, computed for Yes/No, multiple-choice, and short-answer failures.

For open-ended descriptions, correctness is determined by binary human judgment—specifically, whether the generated description is equivalent to or contains the ground-truth statement.

4. Empirical Results and Interpretation

Key comparative results across four leading MLLMs—GPT-4o, GPT-4V, Gemini-1.5-pro, Claude 3—are summarized:

Model	Yes/No Acc	Yes/No LazyRate	MC Acc	MC LazyRate	SA Acc	SA LazyRate	Desc Acc
GPT-4o	60.40	75.00	78.22	37.50	69.37	58.06	84.16
GPT-4V	28.72	70.83	54.45	37.50	55.33	48.89	69.77
Gemini-1.5-pro	50.50	70.00	62.38	46.00	58.42	50.00	76.24
Claude 3	34.65	62.12	54.45	42.42	48.51	38.09	59.34

All models demonstrate their best performance in open-ended description and their worst in binary Yes/No queries. Notably, stronger closed-source models such as GPT-4o and Gemini-1.5-pro not only yield higher raw accuracy but also exhibit the highest lazy rates (e.g., GPT-4o: 75% lazy on Yes/No), indicating that enhanced general capability does not attenuate superficial strategies for simple tasks; it may in fact accentuate them.

A plausible implication is that single-token output tasks (e.g., “Yes”, “A”) prompt a rapid, low-attention computational pathway, whereas multi-token descriptions demand iterative cross-modal attention, resulting in more thorough visual reasoning.

5. Mitigation via Chain of Thought (CoT) Prompting

To address model laziness, a CoT prompting strategy is introduced: the model first generates a description, then answers the simple question based on the extracted details. For example:

“Please describe the motorcycle racer's outfit on his upper body.”
“Based on your description, answer: Is he wearing a long-sleeved suit? Yes or No.”

Key metrics:

FixRate: proportion of original lazy cases corrected by CoT.
Accuracy↑: improved accuracy on simple tasks.

Model	Yes/No FixRate	Yes/No Acc↑	MC FixRate	MC Acc↑	Desc Acc
GPT-4o	37.50%	71.29	43.48%	84.16	84.16
GPT-4V	41.67%	52.48	47.92%	66.34	69.28
Gemini-1.5-pro	44.00%	64.36	26.82%	67.33	76.24
Claude 3	40.91%	52.48	42.11%	58.42	59.34
LLaVA-1.5-13B	36.36%	50.50	54.55%	53.47	48.51

Chain of thought prompting remediates approximately 40% of lazy failures. The largest gains are seen in Yes/No accuracy (e.g., GPT-4o rising from 60.40% to 71.29%), and improvements in multiple choice correctness often match or outpace description accuracy.

6. Broader Analysis and Implications

Analysis on the VQA-v2 dataset using the “Don’t Be Lazy (Doby)” framework reveals that, for LLaVA-1.5-13B, 41.15% of Yes/No failures are lazy cases. This suggests that almost half of simple-question errors in industry-standard VQA may not stem from vision limitations but from insufficient utilization of available information.

Option-bias ablations—including irrelevant and conversed questions—show high accuracy (>90%) and strong visual discrimination, confirming that model laziness is not merely a result of default token bias. Reverse laziness (“RevRate”) is also substantially lower than lazy rate (e.g., GPT-4o: 37.5% vs 75.0%), ruling out random guessing as the primary cause.

The root-cause hypothesis posits that single-token outputs induce “one-shot glance” strategies, in contrast to attentive processing required for multi-token generation. Investigating attentional mechanisms underlying model laziness constitutes a significant open problem.

7. Conclusions and Future Directions

LazyBench provides a systematic means to formalize, quantify, and analyze model laziness in MLLMs. With 101 images and 404 annotated questions spanning four parallel task types, LazyBench demonstrates a consistent and significant gap: models are substantially better at image descriptions than at simple VQA. All evaluated leading MLLMs exhibit lazy rates above 60% on simple tasks, with stronger models often being relatively lazier. Preliminary chain of thought prompting mitigates around 40% of lazy cases. Future directions include probing internal attention patterns for the origins of laziness, expanding the LazyBench suite to encompass additional behavioral phenomena, and developing robust frameworks for effective laziness mitigation (Zhao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LazyBench.