AgroBench: Vision-Language Model Benchmark in Agriculture (2507.20519v1)

Published 28 Jul 2025 in cs.CV

Abstract: Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-LLMs (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

Summary

The paper introduces AgroBench, a comprehensive benchmark featuring seven agricultural tasks with expert annotations to evaluate VLM performance.
It compares closed-source and open-source models, revealing significant gaps in fine-grained tasks such as weed identification.
The study highlights the need for domain-specific pretraining and enhanced visual attention in VLMs for automated crop monitoring and management.

AgroBench: A Comprehensive Vision-LLM Benchmark for Agriculture

Motivation and Context

The development of robust, generalizable vision-LLMs (VLMs) for agriculture is critical for automating complex tasks such as disease identification, pest management, and crop monitoring. Existing agricultural datasets are limited in scope, often focusing on narrow tasks or relying on synthetic annotations. AgroBench addresses these limitations by providing a large-scale, expert-annotated benchmark spanning seven key agricultural tasks, with a breadth of categories unmatched in prior work. The benchmark is designed to rigorously evaluate both open-source and closed-source VLMs in realistic, expert-driven scenarios.

Figure 1: Examples of labeled images for DID, PID, and WID tasks. AgroBench includes 682 crop-disease pairs, 134 pest categories, and 108 weed categories, prioritizing real farm settings.

Benchmark Design and Task Coverage

AgroBench comprises seven tasks, each reflecting a critical aspect of agricultural engineering and real-world farming:

Disease Identification (DID): Fine-grained classification of crop diseases, with 1,502 QA pairs across 682 crop-disease combinations.
Pest Identification (PID): Recognition of 134 pest species, including multiple growth stages.
Weed Identification (WID): Detection and classification of 108 weed species, with bounding box localization.
Crop Management (CMN): Decision-making for optimal crop practices based on visual cues.
Disease Management (DMN): Recommendation of intervention strategies for diseased crops, considering severity and context.
Machine Usage QA (MQA): Selection and operation of agricultural machinery for specific tasks.
Traditional Management (TM): Identification and explanation of sustainable, traditional farming methods.

Each task is formulated as a multiple-choice question, with distractors carefully selected to challenge both visual and domain reasoning. All annotations are performed by agronomist experts, ensuring high-quality, domain-accurate supervision.

Figure 2: Seven benchmark tasks in AgroBench, illustrating the diversity of topics and category coverage. Task accuracy is averaged to mitigate QA count imbalance.

Figure 3: Examples of QA pairs for CMN and DMN tasks, demonstrating nuanced differences in management recommendations based on crop type and disease severity.

Dataset Construction and Annotation Protocol

AgroBench is constructed from a curated pool of ~50,000 images, with a strong emphasis on real-world farm conditions. Images are sourced from public datasets and expert-licensed collections, with laboratory images used only when field data is unavailable. The final dataset contains 4,342 QA pairs, each requiring visual input for correct resolution.

Annotation is performed exclusively by domain experts, with strict guidelines prohibiting the use of LLM-generated knowledge for QA creation. This process ensures that the benchmark reflects authentic agricultural expertise and avoids the propagation of synthetic or internet-derived errors.

Evaluation of Vision-LLMs

AgroBench is used to evaluate a suite of state-of-the-art VLMs, including both closed-source (GPT-4o, Gemini 1.5-Pro/Flash) and open-source (QwenVLM, LLaVA-Next, CogVLM, EMU2Chat) models. Human performance is also measured as a reference, using a cohort of agriculture graduates.

Main Results

Closed-source VLMs (e.g., GPT-4o, Gemini 1.5-Pro) consistently outperform open-source models across all tasks, with GPT-4o achieving the highest overall accuracy.
Open-source VLMs (notably QwenVLM-72B) approach closed-source performance in some tasks but lag significantly in fine-grained identification, especially weed identification.
Weed Identification (WID) is the most challenging task; most open-source models perform at or near random, and even the best closed-source model (Gemini 1.5-Pro) achieves only 55.17% accuracy.
Disease Management (DMN) and Crop Management (CMN) tasks see higher model performance, indicating that VLMs are more adept at contextual reasoning than at fine-grained visual discrimination.

Ablation Studies and Analysis

Visual vs. Textual Input

Ablation experiments with text-only input confirm that visual information is essential for most tasks. However, models can sometimes infer plausible answers for management tasks based on common-sense or statistical priors, even when visual cues are absent.

Chain of Thought (CoT) Reasoning

Incorporating CoT exemplars yields marginal improvements in accuracy, particularly for tasks requiring stepwise reasoning (PID, WID, CMN, TM). However, performance saturates quickly, and the benefit is not uniform across all tasks.

Figure 4: Results of seven benchmark tasks with Chain of Thought (CoT). CoT provides modest gains, especially in reasoning-intensive tasks.

Error Taxonomy

A detailed manual error analysis on GPT-4o outputs reveals the following distribution:

Lack of Knowledge (51.92%): The dominant error mode, where the model lacks fine-grained agricultural knowledge (e.g., symptomatology, pest morphology, management protocols).
Perceptual Error (32.69%): Failures in visual attention or recognition, such as missing small objects or misinterpreting scene context.
Reasoning Error (7.6%): Incorrect logical comparison or stepwise deduction, less prevalent than in general-domain benchmarks.
Other Errors (7.79%): Includes shortcut reasoning, double answers, misinterpretation, and refusal to answer.
Figure 5: Error analysis on seven benchmark tasks with GPT-4o, categorizing the main failure modes.

Figure 6: Error examples of GPT-4o, illustrating Lack of Knowledge and Perceptual Error cases.

Implications and Future Directions

AgroBench exposes significant gaps in current VLMs' ability to perform fine-grained agricultural tasks, particularly in weed and disease identification. The dominance of knowledge-based errors suggests that web-scale pretraining is insufficient for specialized domains with long-tail categories and subtle visual distinctions. Perceptual errors further indicate the need for improved visual encoders and attention mechanisms tailored to agricultural imagery.

Practically, AgroBench provides a rigorous testbed for developing and benchmarking VLMs intended for deployment in precision agriculture, automated crop monitoring, and decision support systems. The expert-annotated, multi-task structure enables targeted model improvement and error analysis.

Theoretically, the benchmark highlights the limitations of current VLM architectures in transferring general visual-linguistic knowledge to specialized, high-stakes domains. Future research should focus on:

Domain-adaptive pretraining and fine-tuning with expert-curated data.
Integration of structured agricultural knowledge bases.
Enhanced visual grounding and attention for small or occluded objects.
Task-specific prompting and reasoning strategies.

Conclusion

AgroBench establishes a new standard for evaluating vision-LLMs in agriculture, with unprecedented category coverage and expert-driven QA design. Current VLMs, while strong in contextual reasoning, exhibit substantial deficiencies in fine-grained identification and domain-specific knowledge. AgroBench will serve as a critical resource for advancing VLM research and deployment in sustainable, automated agriculture.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

GitHub

Coming Soon

Tweets

https://twitter.com/dahlian0/status/1951215228944794017

https://twitter.com/HirokatuKataoka/status/1951162733589831861

alphaXiv

AgroBench: Vision-Language Model Benchmark in Agriculture (12 likes, 0 questions)