AVES: Animal Vocalization Encoder based on Self-Supervision (2210.14493v1)

Published 26 Oct 2022 in cs.SD and eess.AS

Abstract: The lack of annotated training data in bioacoustics hinders the use of large-scale neural network models trained in a supervised way. In order to leverage a large amount of unannotated audio data, we propose AVES (Animal Vocalization Encoder based on Self-Supervision), a self-supervised, transformer-based audio representation model for encoding animal vocalizations. We pretrain AVES on a diverse set of unannotated audio datasets and fine-tune them for downstream bioacoustics tasks. Comprehensive experiments with a suite of classification and detection tasks have shown that AVES outperforms all the strong baselines and even the supervised "topline" models trained on annotated audio classification datasets. The results also suggest that curating a small training subset related to downstream tasks is an efficient way to train high-quality audio representation models. We open-source our models at \url{https://github.com/earthspecies/aves}.

Authors (1)

Masato Hagiwara (15 papers)

Citations (13)

View on Semantic Scholar

Summary

Insights into AVES: A Self-Supervised Model for Bioacoustics

The paper "AVES: Animal Vocalization Encoder based on Self-Supervision" by Masato Hagiwara addresses a core challenge in bioacoustics: the scarcity of annotated datasets necessary for training large neural network models via supervised learning. This problem is particularly acute due to the specialized knowledge required and the significant resource investment needed for annotating bioacoustic data. The proposed solution is AVES, a transformer-based model employing self-supervised learning to encode animal vocalizations effectively without the need for labeled data. AVES stands as a robust alternative to traditional supervised models, including customized CNNs, which have been the prevalent choice so far.

Methodological Foundations and Innovations

AVES leverages HuBERT, a self-supervised approach initially designed for human speech processing, adapting it for animal audio representations. HuBERT uses continuous waveforms and applies techniques like acoustic unit discovery and $k$ -means clustering on MFCC features to generate pseudo labels. The AVES model incorporates these methods to process raw waveforms, employing a CNN encoder for continuous audio representation and a transformer encoder to handle masked sequence prediction tasks. This adaptation enables AVES to encapsulate complex representations of animal vocalizations, validated through its application to animal sound classification and detection tasks.

The process also explores the efficacy of self-supervised representation learning in domains traditionally dominated by CNNs. This aligns with recent trends in NLP and computer vision where self-supervised models have demonstrated substantial success. The research indicates that pretrained models with a self-supervised approach can be competitively fine-tuned for bioacoustic tasks, such as species classification and vocalization detection.

Experimental Validation and Results

The empirical evidence from the paper is compelling, showing AVES’ performance superiority against state-of-the-art baselines and even monitored supervised models tailored with annotated datasets. The model's performance was tested on a comprehensive suite of datasets from the BEANS benchmark, covering varied species and vocalization cases. Notably, the model achieved high accuracy and mean average precision across a diverse set of tasks, surpassing all evaluated baselines.

A comparative analysis was conducted on several pretraining configurations. The findings underscore the importance of curating relevant pretraining data, such as animal-specific subsets, in enhancing model performance. The AVES-{\tt bio} configuration, trained on a focused dataset of bioacoustic relevance, delivered the most robust outcomes. Furthermore, the authors suggest that the sheer volume of data might not necessarily translate to better model efficacy unless accompanied by an appropriate choice of data reflecting the task domain.

Implications and Future Prospects

The implications of AVES extend both theoretically and practically in the field of machine learning and bioacoustics. The research illustrates the potential of self-supervised learning to advance in areas with limited annotated data availability. It opens pathways for future exploration in broader applications, potentially including other domains plagued by label scarcity.

As noted in the conclusion, there are prospects for further enhancing AVES by scaling data and model size and incorporating advanced regularization techniques. The intention to refine these aspects aligns with ongoing advancements in AI model optimization, suggesting a fertile ground for continued research. Open-sourcing the model also enriches the community's ability to build upon and adapt it for further research and practical implementations.

Conclusion

The presented research on AVES provides a substantial contribution to the domain of bioacoustics, blending self-supervised learning with domain-specific applications. It reflects a significant step towards leveraging unlabeled data to generate highly effective audio representations, setting the stage for a new paradigm in analyzing and understanding animal vocalizations. As the field evolves, such frameworks promise to play a pivotal role in translating complex bioacoustic data into valuable insights, driven by the power of self-supervision.

PDF Markdown