LazyBench: Diagnosing Multimodal Model Laziness
- LazyBench is a multimodal benchmark designed to diagnose model laziness by contrasting detailed open-ended image descriptions with simple VQA tasks.
- It employs a controlled dataset with four question types per image and uses binary human evaluation to compute metrics like LazyRate and FixRate.
- The benchmark reveals that state-of-the-art MLLMs excel in descriptive tasks but underperform in simple queries, with chain-of-thought prompting mitigating about 40% of lazy failures.
LazyBench is a manual multimodal benchmark designed to diagnose and quantify “model laziness” in advanced multimodal LLMs (MLLMs). Model laziness refers to an empirical phenomenon wherein MLLMs excel at open-ended image descriptions but perform poorly on simple visual question-answering (VQA) tasks—such as Yes/No, multiple-choice, or short-answer questions—even when the required information is present in the image and accessible to the model. LazyBench provides a controlled dataset spanning four question types about each image, enabling fine-grained analysis of this behavioral discrepancy and its prevalence across state-of-the-art models (Zhao et al., 2024).
1. Formalization of Model Laziness
Model laziness is defined attribute-wise per item in LazyBench. Each item incorporates two correctness indicators:
- Simple task response: if the model answers a Yes/No, multiple-choice, or short-answer question correctly; otherwise $0$.
- Description task response: if the model’s open-ended description is judged correct; otherwise $0$.
A lazy case occurs when the model fails the simple task while succeeding at the description:
The aggregate “lazy rate” over a simple-task subset computes the fraction of simple-task failures that could, in principle, be remedied using the descriptive capability:
This metric isolates the cases where model attention to detail, rather than visual capacity, is the key limiting factor.
2. Benchmark Construction and Properties
LazyBench comprises:
- 101 distinct images, sampled from ImageNet and MMVP.
- 4 questions per image:
- Yes/No (always “No” for subject verification)
- Multiple-choice (3 shuffled options per question)
- Short-answer (single-token response to subject-centric query)
- Description (open-ended, referencing the ground-truth statement)
Image selection utilizes CLIP embeddings, retaining only pairs with cosine similarity and clear, human-perceptible differences. Question curation ensures subject targeting and filters out items already solvable by GPT-4V to maintain difficulty. All correctness annotations rely on binary human evaluation. The dataset’s per-question structure links each prompt to a validated ground-truth statement, maximizing semantic comparability.
Dataset Statistics
| Entity | Count |
|---|---|
| Images | 101 |
| Yes/No | 101 |
| Multiple Choice | 101 |
| Short Answer | 101 |
| Description | 101 |
| Total Questions | 404 |
3. Evaluation Protocol and Metrics
All models are evaluated with temperature = 0 to ensure deterministic generation. Metrics assessed for each question type include:
- Accuracy: fraction of exactly correct answers (simple tasks and descriptions).
- LazyRate: as above, computed for Yes/No, multiple-choice, and short-answer failures.
For open-ended descriptions, correctness is determined by binary human judgment—specifically, whether the generated description is equivalent to or contains the ground-truth statement.
4. Empirical Results and Interpretation
Key comparative results across four leading MLLMs—GPT-4o, GPT-4V, Gemini-1.5-pro, Claude 3—are summarized:
| Model | Yes/No Acc | Yes/No LazyRate | MC Acc | MC LazyRate | SA Acc | SA LazyRate | Desc Acc |
|---|---|---|---|---|---|---|---|
| GPT-4o | 60.40 | 75.00 | 78.22 | 37.50 | 69.37 | 58.06 | 84.16 |
| GPT-4V | 28.72 | 70.83 | 54.45 | 37.50 | 55.33 | 48.89 | 69.77 |
| Gemini-1.5-pro | 50.50 | 70.00 | 62.38 | 46.00 | 58.42 | 50.00 | 76.24 |
| Claude 3 | 34.65 | 62.12 | 54.45 | 42.42 | 48.51 | 38.09 | 59.34 |
All models demonstrate their best performance in open-ended description and their worst in binary Yes/No queries. Notably, stronger closed-source models such as GPT-4o and Gemini-1.5-pro not only yield higher raw accuracy but also exhibit the highest lazy rates (e.g., GPT-4o: 75% lazy on Yes/No), indicating that enhanced general capability does not attenuate superficial strategies for simple tasks; it may in fact accentuate them.
A plausible implication is that single-token output tasks (e.g., “Yes”, “A”) prompt a rapid, low-attention computational pathway, whereas multi-token descriptions demand iterative cross-modal attention, resulting in more thorough visual reasoning.
5. Mitigation via Chain of Thought (CoT) Prompting
To address model laziness, a CoT prompting strategy is introduced: the model first generates a description, then answers the simple question based on the extracted details. For example:
- “Please describe the motorcycle racer's outfit on his upper body.”
- “Based on your description, answer: Is he wearing a long-sleeved suit? Yes or No.”
Key metrics:
- FixRate: proportion of original lazy cases corrected by CoT.
- Accuracy↑: improved accuracy on simple tasks.
| Model | Yes/No FixRate | Yes/No Acc↑ | MC FixRate | MC Acc↑ | Desc Acc |
|---|---|---|---|---|---|
| GPT-4o | 37.50% | 71.29 | 43.48% | 84.16 | 84.16 |
| GPT-4V | 41.67% | 52.48 | 47.92% | 66.34 | 69.28 |
| Gemini-1.5-pro | 44.00% | 64.36 | 26.82% | 67.33 | 76.24 |
| Claude 3 | 40.91% | 52.48 | 42.11% | 58.42 | 59.34 |
| LLaVA-1.5-13B | 36.36% | 50.50 | 54.55% | 53.47 | 48.51 |
Chain of thought prompting remediates approximately 40% of lazy failures. The largest gains are seen in Yes/No accuracy (e.g., GPT-4o rising from 60.40% to 71.29%), and improvements in multiple choice correctness often match or outpace description accuracy.
6. Broader Analysis and Implications
Analysis on the VQA-v2 dataset using the “Don’t Be Lazy (Doby)” framework reveals that, for LLaVA-1.5-13B, 41.15% of Yes/No failures are lazy cases. This suggests that almost half of simple-question errors in industry-standard VQA may not stem from vision limitations but from insufficient utilization of available information.
Option-bias ablations—including irrelevant and conversed questions—show high accuracy (>90%) and strong visual discrimination, confirming that model laziness is not merely a result of default token bias. Reverse laziness (“RevRate”) is also substantially lower than lazy rate (e.g., GPT-4o: 37.5% vs 75.0%), ruling out random guessing as the primary cause.
The root-cause hypothesis posits that single-token outputs induce “one-shot glance” strategies, in contrast to attentive processing required for multi-token generation. Investigating attentional mechanisms underlying model laziness constitutes a significant open problem.
7. Conclusions and Future Directions
LazyBench provides a systematic means to formalize, quantify, and analyze model laziness in MLLMs. With 101 images and 404 annotated questions spanning four parallel task types, LazyBench demonstrates a consistent and significant gap: models are substantially better at image descriptions than at simple VQA. All evaluated leading MLLMs exhibit lazy rates above 60% on simple tasks, with stronger models often being relatively lazier. Preliminary chain of thought prompting mitigates around 40% of lazy cases. Future directions include probing internal attention patterns for the origins of laziness, expanding the LazyBench suite to encompass additional behavioral phenomena, and developing robust frameworks for effective laziness mitigation (Zhao et al., 2024).