Perch 2.0: Advanced Multi-taxa Bioacoustics
- The paper introduces a two-stage supervised training process with self-distillation and prototype-based classification to enhance fine-grained species differentiation.
- Perch 2.0 is defined by its innovative multi-label generalized mixup augmentation and auxiliary source-prediction head, enabling robust transfer across diverse animal taxa.
- It demonstrates outstanding performance on benchmarks like BirdSet and BEANS, achieving ROC-AUC >0.90 and superior transfer results even with minimal marine data.
Perch 2.0 is a state-of-the-art bioacoustic model engineered for large-scale, fine-grained species classification and robust transfer learning across multiple animal taxa. It was trained using a supervised regime with innovations in self-distillation and prototype-based classification. Notably, Perch 2.0 achieves superior performance on established bioacoustic benchmarks and demonstrates remarkable generalization properties, even excelling in marine transfer learning tasks despite extremely limited marine training data. This model reflects an evolutionary shift from avian-centric to broad multi-taxa bioacoustic representation learning, embodying new methodological principles in pretext task selection, data augmentation, and embedding quality.
1. Training Paradigm and Architecture
Perch 2.0 follows a two-stage supervised training process tailored for bioacoustic audio. In the initial phase, the model is trained as a standard species classifier using labeled audio from thousands of vocalizing species. In the subsequent phase, self-distillation is introduced: the predictions from a prototype-learning classifier head (inspired by ProtoPNet concepts) serve as soft targets guiding the main classifier via a stop-gradient separation. This hybrid objective encourages the main classifier to incorporate both precise and refined decision boundaries, leveraging fine-grained differences between often confusable species.
The prototype-learning classifier consists of four trainable prototypes per class acting on the spatial backbone embeddings. These prototypes facilitate both discriminative learning (maximizing inter-class prototype orthogonality via orthogonality loss) and interpretable activation patterns for each class. A critical stop-gradient ensures that the prototype path influences the main classifier solely through predictions, not direct parameter updates, thus shaping the representation space without destabilizing the core backbone.
Perch 2.0 also utilizes a novel auxiliary source-prediction head. In this self-supervised task, given a fixed five-second audio window, the model predicts a one-hot encoding of the source recording among 1.5 million possible entries, using a linear head with a low-rank (rank 512) weight matrix to manage computational demands. This auxiliary task enforces embedding sensitivity to subtle, individual recording signatures, effectively enriching the feature space.
2. Data Set Expansion and Input Construction
Previously trained only on avian vocalizations, Perch 2.0 is constructed on a large multi-taxa corpus, combining audio from Xeno-Canto, iNaturalist, Tierstimmenarchiv, and FSD50K. The resulting data pool spans over 1.5 million recordings and more than 14,000 species or general sound classes, including birds, mammals, amphibians, insects, and environmental/non-biological sounds.
Data augmentation is implemented using a generalized mixup method: instead of standard two-component mixup, the augmentation samples the number of source windows N from a Beta-binomial distribution , draws mixture weights from a symmetric Dirichlet, and produces a signal composite via
with the target class label a multi-hot vector. This augmentation matches real-world acoustic scenarios where multiple species/vocalizations often overlap, and it regularizes the model for robust multi-event discrimination.
3. Training Objectives and Loss Design
The composite loss for Perch 2.0 integrates three main components:
- Cross-entropy loss on the species classifier, with multi-hot targets when multiple classes are present, each target entry set to $1/k$ for classes.
- Self-distillation loss: cross-entropy between the softmax outputs of the linear classifier and the prototype-learning classifier, backpropagated only via the former.
- Prototype orthogonality loss: a penalty term applied to the dot products among each class’s prototype vectors, encouraging maximum inter-class diversity and intra-class compactness.
- Source-prediction loss: cross-entropy for identifying the original recording, via a linear projection with low-rank factorization.
The training regime alternates between these objectives. Critically, no self-supervised pretext tasks are used; the authors explicitly note that self-supervised paradigms underperform relative to strong supervised pre-training in this application.
4. Evaluation: Benchmarks and Transfer Performance
Perch 2.0 is evaluated on BirdSet and BEANS benchmarks:
BirdSet
BirdSet aggregates fully annotated datasets, spanning multiple regions and taxa. Perch 2.0 achieves ROC-AUC values exceeding 0.9 on most sub-datasets and class-mean AP scores that surpass all prior versions, with strong consistency across test regions.
BEANS
BEANS comprises transfer and detection tasks, including linear-probe evaluation (classification performance on new tasks using frozen embeddings) and prototype-based probing. Perch 2.0 achieves record mean accuracy and macro-averaged AP, indicating readily linearly separable and clusterable audio embeddings.
Transfer Learning: Marine Audio
On marine audio datasets (NOAA PIPAN, ReefSet, DCLDE 2026), Perch 2.0’s embeddings are superior even to models trained specifically on marine mammals or fish. For example, with as few as 16 labeled marine examples per class, linear probes yield ROC-AUCs up to 0.977 for certain DCLDE tasks, outperforming marine-specialized architectures. This outcome is especially noteworthy considering almost no marine data was used during training.
Table: Performance Summary Across Benchmarks
Benchmark | Evaluation Task | Perch 2.0 Metric (Best) | Prior Best |
---|---|---|---|
BirdSet | ROC-AUC (multiple) | >0.90 | <0.90 |
BirdSet | cmAP | State-of-the-art | Lower |
BEANS | Lin. probe accuracy | Highest | Lower |
Marine Transfer | ROC-AUC, AP | Up to 0.977 | Lower |
Numerical details and per-dataset breakdowns are available in the full results tables in the original manuscript.
5. Model Compactness and Practicality
Perch 2.0 is constructed using an EfficientNet-B3 backbone, resulting in a model size of 12M parameters, making it deployable on consumer-grade hardware. The architecture enables fast inference and flexible embedding utilities (clustering, nearest-neighbor search, or linear probing) without fine-tuning. The embedding’s linear separability is consistently verified empirically via transfer and few-shot tasks.
6. Broader Methodological Implications
The design choices in Perch 2.0 articulate several broader lessons for bioacoustic representation learning:
- Fine-grained supervised classification is empirically a highly robust pre-training target for audio models in dynamic and heterogeneous environments.
- Prototyped soft-label self-distillation effectively refines class boundaries in fine-grained label spaces.
- Auxiliary source-prediction with large output spaces serves as a useful proxy for generating robust, generalizable embeddings.
- Multi-target generalized mixup equips models to natively solve multi-label and polyphonic tasks without bespoke handling at inference or annotation time.
This suggests that, for bioacoustic tasks and potentially for other ecological audio domains, well-calibrated supervised learning approaches with biologically relevant targets can deliver higher quality, more versatile embeddings than prevailing self-supervised/pseudo-task methods.
7. Applications and Impact in Bioacoustics
Perch 2.0’s primary use cases include large-scale species monitoring, biodiversity surveys, conservation work, and bioacoustic research requiring robust differentiation among thousands of taxa. The model’s generalization ensures applicability in both terrestrial and marine domains, supporting transfer or agile modeling in field deployments where available training data are scarce.
A plausible implication is that biodiversity monitoring organizations and research groups could utilize Perch 2.0 as a universal feature extractor for both classification and exploratory tasks, consolidating previous domain-specific tools into a more unified framework for cross-context audio analysis.
Perch 2.0 exemplifies a modern approach to bioacoustic modeling, harnessing large multi-taxa supervision, self-distillation via prototypes, auxiliary discriminative pretext tasks, and sophisticated data augmentation to achieve outstanding generalization and embedding quality. Its benchmark performance and surprising transfer capabilities redefine architectural and methodological best practices for automated acoustic analysis in ecology and conservation science (Merriënboer et al., 6 Aug 2025).