Explain why bird-trained models transfer better than general-audio models in bioacoustics
Identify the causal factors responsible for the superior cross-domain transfer performance of networks pretrained on large bird bioacoustic datasets (e.g., BirdNET, Perch) compared to networks pretrained on general audio datasets (e.g., YAMNet, VGGish), and determine whether this advantage arises from shared bioacoustic signal properties or from the higher acoustic complexity and variety of bird vocalizations.
References
The performance of existing pretrained networks supports similar work in Ghani et al. (2023) which reported networks pretrained on large and diverse bird bioacoustic datasets generalize better to other bioacoustic domains than those pretrained on more general audio. The reasons behind this remain an open research question, this could be due to common properties between signals across bioacoustic domains, or, the high innate acoustic complexity and variety of bird vocalizations compared to AudioSet.