A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

Published 19 Jul 2023 in cs.CV, cs.AI, and cs.LG | (2307.10455v3)

Abstract: In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier.

Abstract PDF Upgrade to Chat

Citations (13)

View on Semantic Scholar

Summary

The paper presents a novel dataset linking over one million expert-classified insect images with DNA barcodes to bolster biodiversity research.
It employs deep learning models like ResNet-50 and Vision Transformers to tackle challenges such as class imbalance and hierarchical classification, achieving up to 99.69% accuracy for order classification.
The work paves the way for future multi-modal learning studies and enhanced ecological monitoring, with significant implications for species identification and conservation efforts.

An Overview of the BIOSCAN-1M Insect Dataset for Biodiversity Assessment

The paper presents a significant contribution to biodiversity research through the introduction of the BIOSCAN-1M Insect Dataset. This dataset is part of a large-scale effort by the International Barcode of Life (iBOL) Consortium to catalog insect biodiversity worldwide. It encompasses over a million expertly classified images of insects, complete with associated genetic information, including DNA barcode sequences. This enables the creation of a reliable taxonomic reference which can serve various scientific, ecological, and machine learning purposes.

Dataset Characteristics and Structure

The BIOSCAN-1M Insect Dataset is curated to support image-based taxonomic assessment powered by computer vision models. The primary objective is to provide a rich resource to train and evaluate models capable of classifying species based on imagery and genetic information. Each specimen in the dataset is annotated with taxonomic labels and linked to DNA sequences, providing a robust multi-modal dataset.

A notable characteristic of the dataset is the long-tailed distribution of classes, which mirrors the biological diversity and abundance of species. The data are organized hierarchically, following Linnean taxonomy, and cover multiple levels of granularity from order to species. However, due to current limitations, not all samples carry labels at the finest granularity, with a significant proportion being classified only at broader taxonomic levels, such as family.

Challenges and Opportunities in Machine Learning

The paper addresses several core machine learning challenges that arise from the dataset's characteristics. The class imbalance and hierarchical classification problems are particularly pertinent. Class imbalance is significant, given the dataset's power-law distribution, posing challenges for conventional classifiers, which may perform poorly on classes with fewer samples. Hierarchical classification further complicates modeling efforts, requiring algorithms to consider nested label structures.

The authors provide initial baseline models using deep learning approaches such as ResNet-50 and Vision Transformers (ViT). These baselines serve to demonstrate the potential of the dataset in providing a basis for robust classification tasks and highlight areas for future algorithmic improvement, particularly in handling imbalances and leveraging hierarchical labels.

Empirical Results and Baselines

In the experiments, two main classification tasks were tackled: the classification of insect orders and the classification within the Diptera order down to family levels. The models achieved high accuracy, up to 99.69% for order classification, illustrating the quality of the dataset and the effectiveness of modern deep learning architectures. The evaluation also includes insightful analysis through confusion matrices and accuracy reports by class, emphasizing the performance variations across different levels of data granularity and class proportions.

Implications for Biodiversity Research and Future Directions

The BIOSCAN-1M dataset paves the way for integrating deep learning into biodiversity surveys and ecological research, potentially transforming how species are cataloged globally. Practical applications include improved understanding of species distribution and tracking ecological changes over time, which are crucial for conservation efforts. By providing a foundation for automatic species identification, researchers can now undertake previously unimaginable biodiversity assessments at scale.

The dataset also invites further exploration in machine learning methodologies, particularly in unsupervised learning, domain adaptation, and multi-modal learning by utilizing the genetic information alongside images. The integration of DNA sequences into classification tasks poses an intriguing challenge that could yield substantial advances in bioinformatics applications.

In summary, the BIOSCAN-1M Insect Dataset represents a pivotal step towards enabling extensive biodiversity assessments through AI. It offers rich data for advancing machine learning techniques and contributes to the preservation and understanding of ecological diversity. As iBOL continues to expand this endeavor, the potential for transformative breakthroughs in both biological sciences and AI grows, signaling a promising intersection of fields.

Markdown