AgriBench-Omni-2K: Tri-modal Agricultural Benchmark

Updated 18 December 2025

AgriBench-Omni-2K is the first comprehensive tri-modal benchmark that integrates speech, vision, and text for agricultural question answering.
It supports four evaluation tasks with both synthetic and real speech data across six languages to ensure reproducible and robust model assessment.
Standardized metrics, unified protocols, and a dedicated toolkit facilitate transparent performance evaluation and drive advances in multimodal agricultural AI.

AgriBench-Omni-2K is the first comprehensive tri-modal (speech, vision, text) benchmark for agricultural intelligence, offering standardized evaluation protocols, reproducible tools, and multilingual coverage across six languages. It was introduced as a core component of the AgriGPT-Omni omni-framework to address the lack of unified benchmarks in agricultural multimodal modeling, especially those involving spoken questions, imagery, and diverse languages (Yang et al., 11 Dec 2025). AgriBench-Omni-2K structures four key tasks in question answering and multiple-choice formats, supports extensive synthetic and real (human-recorded) speech data, and enforces stringent protocols for dataset overlap and scoring. Its design enables rigorous and reproducible evaluation of speech-vision-text models in agricultural contexts.

1. Task Definitions and Modalities

AgriBench-Omni-2K formalizes four benchmark tasks, all framed as agricultural question answering or multiple-choice selection, with explicit tri-modal input–output configurations:

Audio-only QA (Speech→Text):
- Input: Single spoken agricultural question.
- Output: Free-form text answer in the corresponding language.
- Objective: Assess end-to-end speech comprehension and domain text generation.
Audio+Text Multiple-Choice (Speech + Text→Text):
- Input: Spoken question and $K\geq2$ textual answer options.
- Output: Correct choice as index or text.
- Objective: Evaluate speech recognition and semantic mapping to a closed answer set.
Multimodal QA (Speech + Image→Text):
- Input: Spoken question and a related image (e.g., plant or disease photograph).
- Output: Open-ended text answer, requiring cross-modal reasoning.
- Objective: Test cross-modal grounding (e.g., visual disease identification).
Multimodal Multiple-Choice (Speech + Image + Text→Text):
- Input: Spoken question, image, and $K$ textual options.
- Output: Correct option selection.
- Objective: Full tri-modal alignment and contextually informed decision making.

Each task operationalizes the tri-modal setting, requiring models to fuse or align speech, text, and, where applicable, visual signals.

2. Language Diversity and Data Composition

AgriBench-Omni-2K provides systematic multilingual coverage over six languages: Mandarin Chinese ("zh-Cn"), Sichuan dialect ("zh-Sc"), Cantonese ("zh-Yue"), English ("en"), Japanese ("ja"), and Korean ("ko"). For each task and language, the dataset comprises 100 synthesized or recorded samples, structured as follows:

Task Type	Synthetic Samples / Language	Real Human-Recorded Clips / Language
Audio-only QA	100	≈25
Audio+Text Multiple-Choice	100	≈25
Multimodal QA	100	≈25
Multimodal Multiple-Choice	100	≈25

Synthetic: 2,400 total samples ( $400 \times 6$ languages), generated via TTS.
Real speech: 586 curated recordings (mean ≈98 per language, all tasks pooled), with noise and accent variation.
Images: 100 unique images reused in multimodal tasks, yielding 1,200 image-question pairs across six languages.
Textual artifacts: 100 base questions (originating in English, then translated), 400 MC option sets ( $K=4$ ), and human-vetted ground-truths.

This modular arrangement enables controlled analysis across language and mode, while robustly supporting low-resource dialects.

3. Evaluation Protocols and Dataset Splits

AgriBench-Omni-2K functions exclusively as an evaluation suite, with no training split:

Test-only suite: All tasks use a single held-out test set.
Zero-overlap assurance: Any evaluation question with $>0.7$ ROUGE-L similarity to training data is pruned; GPT-4 further filters semantic paraphrases.
Synthetic and real partitioning: Real-speech samples are disjoint from synthetic or training sets; human-recorded samples are utilized solely for robustness testing.
No cross-validation: Static task definitions and splits, increasing reproducibility across research efforts.

These constraints prevent information leakage, support rigorous benchmarking, and facilitate consistent cross-model comparison.

4. Standardized Evaluation Metrics

AgriBench-Omni-2K adopts open-source, unified scoring with protocols and mathematical definitions specified for each task type:

Open-ended QA (Audio-only, Multimodal QA):

BLEU-4:

$\text{BLEU} = BP \cdot \exp \left( \sum_{n=1}^4 w_n \cdot \ln p_n \right)$

with $p_n$ as n-gram precision, $w_n=1/4$ , and BP the brevity penalty.

METEOR:

$\text{METEOR} = F_{\text{mean}} = \frac{10PR}{R + 9P}$

where $P,R$ are unigram precision/recall (after synonym and stem matching).

ROUGE-L-F1:

$P_L = \frac{LCS}{|H|},\quad R_L = \frac{LCS}{|R|},\quad \text{ROUGE-L-F1} = \frac{(1+\beta^2)P_LR_L}{R_L + \beta^2 P_L}$

with $\beta=1.0$ .

Multiple-Choice Tasks (Audio+Text MC, Multimodal MC):

Accuracy:

$\text{Accuracy} = \frac{\text{\# correct selections}}{\text{\# total samples}}$

Alternative notation:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

Optional Human Evaluation:

Pairwise Win Rate:

$\text{Win Rate}_A = \frac{\text{\#cases where A's answer is preferred}}{\text{\#evaluated pairs}}$

Speech Transcription Robustness:

WER (English): $(S + D + I)/N_w$
CER (non-English): $(S + D + I)/N_c$ Where $S, D, I$ are substitutions, deletions, insertions; $N_w, N_c$ reference word/character counts.

All metrics are reported per language, then macro-averaged to provide unified comparison. Standardized scripts and APIs ensure results are directly comparable across research groups.

5. Tooling and Reproducibility

AgriBench-Omni-2K is distributed with a reproducibility-oriented toolkit:

Docker container: agri-omni-eval:latest with all libraries (PyTorch, NLTK, SacreBLEU, rouge-score).
Evaluation script: evaluate.py, callable via command-line arguments specifying predictions, references, task, and language.
JSON I/O format: Standardized field structure for audio, images, questions, options, and ground-truths:

{
  "id": "<unique id>",
  "audio_path": "<.wav>",
  "image_path": "<.jpg>", // optional
  "question": "<text transcription>",
  "options": ["opt1", "opt2", "opt3", "opt4"], // for MC only
  "ground_truth": "<text or option index>"
}

Randomization seeds: All runs use seed 42; default beam size is five for generative decode.
Human evaluation GUI: Rubric-based judging interface, shuffling options to mitigate bias and support pairwise ranking.

This infrastructure ensures that results are replicable, directly comparable, and agnostic to platform idiosyncrasies or dependency drift.

6. Task Examples

Representative cases from each benchmark task, English transcript for clarity:

Audio-only QA:

Input: "When should rice be fertilized?" Output: "At the tillering stage."

Audio+Text Multiple-Choice:

Input: "Which disease causes leaf curl?" Options: (a) Rice blast, (b) Rust, (c) Sheath blight, (d) Powdery mildew Correct: c) Sheath blight

Multimodal QA:

Image: Yellow-spotted leaf Audio: "What symptom is shown?" Output: "Leaf spot disease due to fungal infection."

Multimodal Multiple-Choice:

Image: Wilting tomato plant Audio: "Why is the plant drooping?" Options: (a) Overwatering, (b) Nitrogen deficiency, (c) Verticillium wilt, (d) Aphid infestation Correct: c) Verticillium wilt

These exemplars illustrate the necessity for coherent multimodal, multilingual understanding and grounding in agricultural contexts.

7. Significance, Limitations, and Applications

AgriBench-Omni-2K provides the agricultural domain with its first unified benchmark for evaluating models capable of integrated speech, vision, and text reasoning across major Asian and Western languages (Yang et al., 11 Dec 2025). Its focus on question answering and decision making, combined with robust multilingual and modal variations, supports the development of inclusive agricultural intelligence and sustainable AI for low-resource regions. The absence of a training split, rigorous overlap filtering, and integrated evaluation tools set a high standard for reproducibility.

A plausible implication is that future research adopting AgriBench-Omni-2K can systematically probe multimodal, multilingual alignment and generalization, while real-speech robustness data enables deeper analysis of deployment for field applications. The protocol's restriction to evaluation tasks only may limit direct benchmarking of transfer-learning scenarios; however, its controlled environment ensures that reported model advances are not artifacts of contamination or inconsistent scoring.

By releasing all code, data, and evaluation infrastructure, AgriBench-Omni-2K positions itself as a central resource for comparative research in agricultural AI, particularly in the context of speech-vision-text integration (Yang et al., 11 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AgriBench-Omni-2K.