BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Published 25 Jun 2024 in cs.CV | (2406.17720v2)

Abstract: We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.

Abstract PDF HTML Upgrade to Chat

Authors (15)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Arboretum, the largest multimodal dataset with 134.6 million images and 326,888 species for advanced AI biodiversity models.
It employs a streamlined pipeline with rich annotations and detailed taxonomic hierarchies to enhance model generalization and species recognition.
Evaluation using ArborCLIP models demonstrates strong performance, achieving a top-1 accuracy of 91.1% on biodiversity benchmarks.

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

The paper "Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity" presents the Arboretum dataset, the largest publicly accessible collection of captioned images designed to advance AI applications in biodiversity. Curated from the iNaturalist community science platform and meticulously vetted by domain experts, Arboretum surpasses existing datasets by an order of magnitude with its 134.6 million images covering 326,888 species. The dataset includes diverse multimodal data from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it an invaluable resource for multimodal vision-language AI models for biodiversity assessment and agricultural research.

Characteristics and Utility of Arboretum

The dataset's images are annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. Such comprehensive annotations facilitate AI models' ability to learn relationships across diverse taxonomic categories, thereby improving their generalization capabilities in various scientific contexts. Each image’s metadata integrates these annotations seamlessly, ensuring that the dataset is readily usable for AI applications without additional processing.

Arboretum stands out due to several key features:

Scale and Diversity: Arboretum contains 134.6 million images, significantly more than other state-of-the-art datasets like TreeOfLife-10M, which includes 10.4 million images. It encompasses a remarkable range of 326,888 species, providing unparalleled species diversity.
High-Quality Annotations: Each image is carefully annotated not only with common and scientific names but also detailed taxonomic hierarchies. This allows for more nuanced training of AI models, enabling them to understand and process fine-grained taxonomic information.
Tooling Pipeline: The authors provide a streamlined tooling pipeline for curating the Arboretum dataset. This pipeline enables users to filter, visualize, and manage data subsets effectively, making the dataset highly accessible and easy to use for further research.

Development and Benchmarking of ArborCLIP

To demonstrate the dataset's utility, the authors trained ArborCLIP, a suite of vision-LLMs, using a 40 million image subset of Arboretum. ArborCLIP models showed strong performance across various benchmarks, particularly in zero-shot classifications, highlighting their generalization capabilities. The authors report significantly high accuracy, achieving a top-1 accuracy of 91.1% on the Arboretum-Balanced benchmark.

New Benchmarks and Experimental Evaluations

Several new benchmark datasets were introduced:

Arboretum-Balanced: Designed to provide a consistent basis for model performance by maintaining a balanced species distribution.
Arboretum-Unseen: Focuses on evaluating model generalization to species unseen during training.
Arboretum-LifeStages: Assesses models' ability to recognize species across various developmental stages, mainly in insect species.

ArborCLIP's performance was comprehensively evaluated against multiple existing benchmark datasets, including BioCLIP-Rare, Fungi, DeepWeeds, Confounding Species, and Insects-2. These evaluations demonstrated that ArborCLIP models achieve state-of-the-art performance in several settings, indicating the significant potential of the Arboretum dataset in advancing AI applications in biodiversity.

Implications and Future Directions

The release of Arboretum is poised to drive significant advancements in the development of AI models for biodiversity-related applications. By enabling the creation of precise and generalizable models, it stands to impact various fields such as:

Pest Control and Crop Monitoring: Enhanced species recognition can lead to more effective pest management strategies and monitoring of crop health, crucial for ensuring food security.
Biodiversity Assessment and Conservation: Robust AI models can assist in large-scale biodiversity assessments, aiding conservation efforts and the monitoring of ecological changes.
Environmental Impact Studies: Improved species identification and tracking can support studies on the impact of climate change and habitat loss on biodiversity.

Arboretum addresses many limitations of previous datasets, such as geographical biases, incomplete taxonomic information, and scalability issues. The dataset's availability as an "AI-ready" resource ensures that researchers can readily apply it to various ecological and agricultural challenges.

Conclusion

The introduction of the Arboretum dataset marks a significant advancement in the resources available for AI research in biodiversity and agriculture. Its scale, diversity, and high-quality annotations set it apart from existing datasets, providing a robust foundation for developing advanced AI models. The successful training and evaluation of ArborCLIP models underscore the dataset's potential to drive future innovations. By making Arboretum publicly available with an accessible tooling pipeline, the authors have provided an invaluable resource that is set to catalyze advancements in AI applications across biodiversity, agriculture, and environmental conservation.

Markdown Report Issue