A Vision-Language Foundation Model for Leaf Disease Identification (2505.07019v1)

Published 11 May 2025 in cs.CV

Abstract: Leaf disease identification plays a pivotal role in smart agriculture. However, many existing studies still struggle to integrate image and textual modalities to compensate for each other's limitations. Furthermore, many of these approaches rely on pretraining with constrained datasets such as ImageNet, which lack domain-specific information. We propose SCOLD (Soft-target COntrastive learning for Leaf Disease identification), a context-aware vision-language foundation model tailored to address these challenges for agricultural tasks. SCOLD is developed using a diverse corpus of plant leaf images and corresponding symptom descriptions, comprising over 186,000 image-caption pairs aligned with 97 unique concepts. Through task-agnostic pretraining, SCOLD leverages contextual soft targets to mitigate overconfidence in contrastive learning by smoothing labels, thereby improving model generalization and robustness on fine-grained classification tasks. Experimental results demonstrate that SCOLD outperforms existing vision-LLMs such as OpenAI-CLIP-L, BioCLIP, and SigLIP2 across several benchmarks, including zero-shot and few-shot classification, image-text retrieval, and image classification, while maintaining a competitive parameter footprint. Ablation studies further highlight SCOLD's effectiveness in contrast to its counterparts. The proposed approach significantly advances the agricultural vision-language foundation model, offering strong performance with minimal or no supervised fine-tuning. This work lays a solid groundwork for future research on models trained with long-form and simplified contexts, tasks involving class ambiguity, and multi-modal systems for intelligent plant disease diagnostics. The code for this study is available at https://huggingface.co/enalis/scold

Summary

Vision-Language Foundation Model for Leaf Disease Identification

The paper introduced a novel approach, SCOLD (Soft-target Contrastive Learning for Leaf Disease identification), tailored for leaf disease identification within the field of smart agriculture. This robust vision-language foundation model was conceived to seamlessly integrate image and textual modalities, effectively addressing the inherent challenges posed by leaf disease identification tasks. SCOLD utilizes a context-aware learning mechanism to mitigate overconfidence issues in conventional contrastive learning paradigms, thereby enhancing model generalization capabilities.

Overview and Methodology

SCOLD leverages a dual-stream architecture, comprising a visual encoder (Swin-T) and a textual encoder (RoBERTa), to transform image and text inputs into shared, high-dimensional embedding spaces. Through an extensive pre-training process using the LeafNet dataset—an assembly of over 186,000 images across 97 unique concepts—SCOLD was refined to outperform existing methodologies without necessitating extensive supervised fine-tuning. Notably, SCOLD introduces a Context-Aware Soft Target (CST) mechanism, refining the contrastive learning strategy to consider semantic similarities within data distributions, thereby producing more nuanced embeddings.

Experimental Validation

Empirical evaluations reveal SCOLD’s superior performance across several benchmarks compared to popular alternatives such as CLIP, SigLIP2, and BioCLIP. During zero-shot classification tasks across ten diverse Out-of-Distribution (OOD) datasets, SCOLD achieved the highest overall accuracy of 34.80%, underscoring its efficacy in domain adaptation. Furthermore, SCOLD exhibited robust performance in few-shot classification settings with minimal training data, maintaining competitive accuracy levels across various shot configurations.

In image-text retrieval assessments, SCOLD displayed superior recall metrics, with R@1, R@5, and R@10 values exceeding those of baseline models. Comprehensive ablation studies further corroborate the efficacy of SCOLD's architecture, highlighting the contextual enrichment provided by its long-context prompting and CST labeling strategies.

Implications and Future Directions

SCOLD represents a significant advancement in the application of vision-LLMs within agricultural contexts, notably in the sophisticated domain of leaf disease diagnosis. Its deployment paves the way for more intelligent diagnostics and monitoring systems that harmonize visual and textual modalities, thereby facilitating sustainable farming practices.

The LeafNet dataset provides a solid backbone for SCOLD’s pre-training, offering diverse plant disease information crucial for intelligent systems. Future research endeavor could explore the expansion of SCOLD into object detection and segmentation domains, harnessing the enhanced feature representation capabilities to tackle broader agricultural challenges.

Through methodological innovations such as context-aware soft targets and enriched conceptual contexts, SCOLD sets the stage for further exploration into multimodal systems, promising improvements in crop protection and precision agriculture. This work lays the groundwork for subsequent studies aiming to develop more holistic AI-driven solutions in agriculture, substantiating the model's practical utility and theoretical foundation.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

A Vision-Language Foundation Model for Leaf Disease Identification (2505.07019v1)

Summary

Vision-Language Foundation Model for Leaf Disease Identification

Follow-up Questions

Related Papers

Authors (3)