Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

Published 4 Sep 2025 in cs.LG, cs.AI, cs.CL, and cs.SD | (2509.04166v1)

Abstract: Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that pre-trained speech models (HuBERT, WavLM, XEUS) can effectively encode and classify animal vocalizations in bioacoustic tasks.
It introduces linear probing combined with time-weighted averaging to leverage contextual information in lengthy animal sound recordings.
The study highlights that robustness to noise and frequency variations is essential for adapting speech models to diverse bioacoustic applications.

Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

This paper investigates the transfer of self-supervised speech models to bioacoustic detection and classification tasks, highlighting the potential of speech-based models in bioacoustic research. The study primarily focuses on the adaptation of models like HuBERT, WavLM, and XEUS for processing animal sounds across various species and taxa.

Introduction

Self-supervised learning (SSL) has significantly enhanced speech processing through models trained on large unlabeled datasets, achieving notable performance improvements in linguistic tasks. This paper examines whether these advancements can be extended to animal vocalizations, which are less supported by extensive labeled datasets. The authors aim to understand the cross-domain transferability of pre-trained speech representations to bioacoustic tasks.

Methods

The authors employed three pre-trained SSL speech models: HuBERT, WavLM, and XEUS, to assess their ability to generate enriched representations of animal sounds. The models were evaluated on 11 bioacoustic tasks spanning various species. Linear probing and time-weighted averaging (T-WA) of representations were introduced to leverage contextual information which could be crucial for long sound samples.

Figure 1: Workflow of the transfer learning method.

The study utilized the publicly available BEANS benchmark to extract datasets featuring tasks such as animal species classification, individual identification, and call-type detection. Evaluations were conducted using accuracy and mean average precision (mAP) metrics, ensuring compatibility with pre-trained speech models' data sample rate requirements.

Results

The results demonstrated that HuBERT, WavLM, and XEUS models are capable of encoding sufficient bioacoustic information to perform well across various animal taxa. Best performance was observed between speech model layers 3-11 for HuBERT, 4-15 for WavLM, and 2-6 for XEUS, aligning with previous research which shows superior representation capabilities in non-deep layers. Notable performance improvements were observed using time-weighted averaging, particularly in datasets with longer sound samples.

Figure 2: Performance for the Egyptian fruit bats dataset on the 10th layer with pitch shifting (T-A).

Figure 3: Performance for the Egyptian fruit bats dataset on the 10th layer with noise addition (T-A).

Discussion

The study explores four main factors impacting performance: model robustness to noise, overlapping species vocalizations, time-related representation, and frequency range variances with speech.

Model Robustness and Cross-species Transfer

Pre-training setups incorporating noise reduction strategies (WavLM and XEUS) resulted in improved robustness against low signal-to-noise ratios, common in bioacoustic data. The multilingual pre-training of XEUS further contributed to superior performance, although an assessment against simpler baseline models revealed substantial intrinsic capabilities without extensive pre-training.

Temporal Analysis

Time-wise representation was crucial, and preserving variability showed advantages in datasets featuring lengthy recordings. Linear probes efficiently extracted signal information, outperforming more complex recurrent model setups.

Frequency and Noise Impact

Shifting frequencies and managing noise levels provided deeper insights into model limitations, with observations indicating robustness across varying frequency ranges. However, extreme manipulations failed to enhance bioacoustic signal clarity beyond certain thresholds.

Conclusion

The studied speech models exhibit strong potential for bioacoustic applications, providing a competitive alternative to domain-specific models. Speech-based SSL frameworks present viable pathways for bioacoustic research, suggesting future advancements through enhanced pre-training strategies and robust data management. The utilization of foundation models shared between human speech and animal sounds could lead to significant breakthroughs in computational bioacoustics.

Overall, this paper underscores the efficiency and adaptability of speech models in bioacoustic scenarios, advocating for continued exploration into cross-domain transfer learning methodologies.

Markdown