CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale (2405.17537v3)
Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
- VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems, pages 24206–24221. Curran Associates, Inc., 2021.
- Self-supervised multimodal versatile networks. In Advances in Neural Information Processing Systems, pages 25–37. Curran Associates, Inc., 2020.
- BarcodeBERT: Transformers for biodiversity analysis. arXiv preprint arXiv:2311.02401, 2023.
- Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203, 2021.
- Fine-grained zero-shot learning with DNA as side information. In Advances in Neural Information Processing Systems, pages 19352–19362. Curran Associates, Inc., 2021.
- Classifying the unknown: Insect identification with deep hierarchical Bayesian learning. Methods in Ecology and Evolution, 14(6):1515–1530, 2023.
- Birdsnap: Large-scale fine-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2019–2026, 2014.
- SNP2Vec: Scalable self-supervised pre-training for genome-wide association study. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 140–154, Dublin, Ireland, 2022. Association for Computational Linguistics.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12(1):18958, 2022.
- Applications for deep learning in ecology. Methods in Ecology and Evolution, 10(10):1632–1644, 2019.
- When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14755–14764, 2022.
- The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- CLOOB: Modern hopfield networks with InfoLOOB outperform CLIP. In Advances in Neural Information Processing Systems, pages 20450–20468. Curran Associates, Inc., 2022.
- Pl@ntNet-300K: a plant image dataset with high label ambiguity and a long-tailed distribution. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Curran Associates, Inc., 2021.
- A step towards worldwide biodiversity assessment: The BIOSCAN-1M insect dataset. In Advances in Neural Information Processing Systems, pages 43593–43619. Curran Associates, Inc., 2024.
- ImageBind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023.
- CyCLIP: Cyclic contrastive language-image pretraining. In Advances in Neural Information Processing Systems, pages 6704–6719. Curran Associates, Inc., 2022.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1):1–23, 2021.
- Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5994–6002, 2017.
- Biological identifications through DNA barcodes. Proceedings of the Royal Society of London. Series B: Biological Sciences, 270(1512):313–321, 2003.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Quilt-1M: One million image-text pairs for histopathology. In Advances in Neural Information Processing Systems, pages 37995–38017. Curran Associates, Inc., 2023.
- DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- BERT-Promoter: An improved sequence-based predictor of dna promoter using bert pre-trained model and shap feature selection. Computational Biology and Chemistry, 99:107732, 2022.
- Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nature Communications, 13(1):6678, 2022.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In 10th International Conference on Learning Representations, 2022c.
- Applications of deep learning in understanding gene regulation. Cell Reports Methods, 3(1):100384, 2023b.
- DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1):lqaa009, 2020.
- Towards a visual-language foundation model for computational pathology. arXiv preprint arXiv:2307.12914, 2023.
- The insect cytochrome oxidase I gene: evolutionary patterns and conserved primers for phylogenetic studies. Insect Molecular Biology, 5(3):153–165, 1996.
- Presence-only geographical priors for fine-grained image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9596–9606, 2019.
- A survey on image-based insect classification. Pattern Recognition, 65:273–284, 2017.
- UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proceedings of the National Academy of Sciences, 119(35):e2122636119, 2022.
- HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. In Advances in Neural Information Processing Systems, pages 43177–43201. Curran Associates, Inc., 2023a.
- Insect-Foundation: A foundation model and large-scale 1M dataset for visual insect understanding. arXiv preprint arXiv:2311.15206, 2023b.
- Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Gerald Piosenka. Birds 525 species - image classification, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- From categories to subcategories: large-scale image classification with partial class label refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 231–239, 2015.
- TriCoLo: Trimodal contrastive loss for text to shape retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5815–5825, 2024.
- Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
- Bi-modal progressive mask attention for fine-grained recognition. IEEE Transactions on Image Processing, 29:7006–7018, 2020.
- BioCLIP: A vision foundation model for the tree of life. arXiv preprint arXiv:2311.18803, 2023.
- Nigel E Stork. How many species of insects and other terrestrial arthropods are there on Earth? Annual review of entomology, 63(1):31–45, 2018. PMID: 28938083.
- A weakly supervised fine label classifier enhanced by coarse supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6459–6468, 2019.
- Transfer learning enables predictions in network biology. Nature, 618(7965):616–624, 2023.
- Grafit: Learning fine-grained image representations with coarse labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 874–884, 2021.
- Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962, 2019.
- The iNaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
- Open-set recognition: a good closed-set classifier is all you need? In 10th International Conference on Learning Representations, 2022.
- Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8927–8948, 2022.
- What should not be contrastive in contrastive learning. In International Conference on Learning Representations, 2021.
- Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
- Audio visual attribute discovery for fine-grained object recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2024.
- Contrastive learning of medical visual representations from paired images and text. In Proceedings of the 7th Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022.
- DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. In International Conference on Learning Representations, 2024a.
- DNABERT-S: Learning species-aware DNA embedding with genome foundation models. arXiv preprint arXiv:2402.08777, 2024b.