Perch 2.0: Advanced Multi-taxa Bioacoustics

Updated 7 August 2025

The paper introduces a two-stage supervised training process with self-distillation and prototype-based classification to enhance fine-grained species differentiation.
Perch 2.0 is defined by its innovative multi-label generalized mixup augmentation and auxiliary source-prediction head, enabling robust transfer across diverse animal taxa.
It demonstrates outstanding performance on benchmarks like BirdSet and BEANS, achieving ROC-AUC >0.90 and superior transfer results even with minimal marine data.

Perch 2.0 is a state-of-the-art bioacoustic model engineered for large-scale, fine-grained species classification and robust transfer learning across multiple animal taxa. It was trained using a supervised regime with innovations in self-distillation and prototype-based classification. Notably, Perch 2.0 achieves superior performance on established bioacoustic benchmarks and demonstrates remarkable generalization properties, even excelling in marine transfer learning tasks despite extremely limited marine training data. This model reflects an evolutionary shift from avian-centric to broad multi-taxa bioacoustic representation learning, embodying new methodological principles in pretext task selection, data augmentation, and embedding quality.

1. Training Paradigm and Architecture

Perch 2.0 follows a two-stage supervised training process tailored for bioacoustic audio. In the initial phase, the model is trained as a standard species classifier using labeled audio from thousands of vocalizing species. In the subsequent phase, self-distillation is introduced: the predictions from a prototype-learning classifier head (inspired by ProtoPNet concepts) serve as soft targets guiding the main classifier via a stop-gradient separation. This hybrid objective encourages the main classifier to incorporate both precise and refined decision boundaries, leveraging fine-grained differences between often confusable species.

The prototype-learning classifier consists of four trainable prototypes per class acting on the spatial backbone embeddings. These prototypes facilitate both discriminative learning (maximizing inter-class prototype orthogonality via orthogonality loss) and interpretable activation patterns for each class. A critical stop-gradient ensures that the prototype path influences the main classifier solely through predictions, not direct parameter updates, thus shaping the representation space without destabilizing the core backbone.

Perch 2.0 also utilizes a novel auxiliary source-prediction head. In this self-supervised task, given a fixed five-second audio window, the model predicts a one-hot encoding of the source recording among 1.5 million possible entries, using a linear head with a low-rank (rank 512) weight matrix to manage computational demands. This auxiliary task enforces embedding sensitivity to subtle, individual recording signatures, effectively enriching the feature space.

2. Data Set Expansion and Input Construction

Previously trained only on avian vocalizations, Perch 2.0 is constructed on a large multi-taxa corpus, combining audio from Xeno-Canto, iNaturalist, Tierstimmenarchiv, and FSD50K. The resulting data pool spans over 1.5 million recordings and more than 14,000 species or general sound classes, including birds, mammals, amphibians, insects, and environmental/non-biological sounds.

Data augmentation is implemented using a generalized mixup method: instead of standard two-component mixup, the augmentation samples the number of source windows N from a Beta-binomial distribution $N \sim \mathrm{BetaBin}(n, \alpha, \beta) + 1$ , draws mixture weights $w \sim \operatorname{SymDir}(N, \omega)$ from a symmetric Dirichlet, and produces a signal composite via

$\text{mixup}(x_1, ..., x_N) = \frac{\sum_{n=1}^{N} w_n x_n}{\sqrt{\sum_{n=1}^{N} w_n^2}}$

with the target class label a multi-hot vector. This augmentation matches real-world acoustic scenarios where multiple species/vocalizations often overlap, and it regularizes the model for robust multi-event discrimination.

3. Training Objectives and Loss Design

The composite loss for Perch 2.0 integrates three main components:

Cross-entropy loss on the species classifier, with multi-hot targets when multiple classes are present, each target entry set to $1/k$ for $k$ classes.
Self-distillation loss: cross-entropy between the softmax outputs of the linear classifier and the prototype-learning classifier, backpropagated only via the former.
Prototype orthogonality loss: a penalty term applied to the dot products among each class’s prototype vectors, encouraging maximum inter-class diversity and intra-class compactness.
Source-prediction loss: cross-entropy for identifying the original recording, via a linear projection with low-rank factorization.

The training regime alternates between these objectives. Critically, no self-supervised pretext tasks are used; the authors explicitly note that self-supervised paradigms underperform relative to strong supervised pre-training in this application.

4. Evaluation: Benchmarks and Transfer Performance

Perch 2.0 is evaluated on BirdSet and BEANS benchmarks:

BirdSet

BirdSet aggregates fully annotated datasets, spanning multiple regions and taxa. Perch 2.0 achieves ROC-AUC values exceeding 0.9 on most sub-datasets and class-mean AP scores that surpass all prior versions, with strong consistency across test regions.

BEANS

BEANS comprises transfer and detection tasks, including linear-probe evaluation (classification performance on new tasks using frozen embeddings) and prototype-based probing. Perch 2.0 achieves record mean accuracy and macro-averaged AP, indicating readily linearly separable and clusterable audio embeddings.

Transfer Learning: Marine Audio

On marine audio datasets (NOAA PIPAN, ReefSet, DCLDE 2026), Perch 2.0’s embeddings are superior even to models trained specifically on marine mammals or fish. For example, with as few as 16 labeled marine examples per class, linear probes yield ROC-AUCs up to 0.977 for certain DCLDE tasks, outperforming marine-specialized architectures. This outcome is especially noteworthy considering almost no marine data was used during training.

Table: Performance Summary Across Benchmarks

Benchmark	Evaluation Task	Perch 2.0 Metric (Best)	Prior Best
BirdSet	ROC-AUC (multiple)	>0.90	<0.90
BirdSet	cmAP	State-of-the-art	Lower
BEANS	Lin. probe accuracy	Highest	Lower
Marine Transfer	ROC-AUC, AP	Up to 0.977	Lower

Numerical details and per-dataset breakdowns are available in the full results tables in the original manuscript.

5. Model Compactness and Practicality

Perch 2.0 is constructed using an EfficientNet-B3 backbone, resulting in a model size of 12M parameters, making it deployable on consumer-grade hardware. The architecture enables fast inference and flexible embedding utilities (clustering, nearest-neighbor search, or linear probing) without fine-tuning. The embedding’s linear separability is consistently verified empirically via transfer and few-shot tasks.

6. Broader Methodological Implications

The design choices in Perch 2.0 articulate several broader lessons for bioacoustic representation learning:

Fine-grained supervised classification is empirically a highly robust pre-training target for audio models in dynamic and heterogeneous environments.
Prototyped soft-label self-distillation effectively refines class boundaries in fine-grained label spaces.
Auxiliary source-prediction with large output spaces serves as a useful proxy for generating robust, generalizable embeddings.
Multi-target generalized mixup equips models to natively solve multi-label and polyphonic tasks without bespoke handling at inference or annotation time.

This suggests that, for bioacoustic tasks and potentially for other ecological audio domains, well-calibrated supervised learning approaches with biologically relevant targets can deliver higher quality, more versatile embeddings than prevailing self-supervised/pseudo-task methods.

7. Applications and Impact in Bioacoustics

Perch 2.0’s primary use cases include large-scale species monitoring, biodiversity surveys, conservation work, and bioacoustic research requiring robust differentiation among thousands of taxa. The model’s generalization ensures applicability in both terrestrial and marine domains, supporting transfer or agile modeling in field deployments where available training data are scarce.

A plausible implication is that biodiversity monitoring organizations and research groups could utilize Perch 2.0 as a universal feature extractor for both classification and exploratory tasks, consolidating previous domain-specific tools into a more unified framework for cross-context audio analysis.

Perch 2.0 exemplifies a modern approach to bioacoustic modeling, harnessing large multi-taxa supervision, self-distillation via prototypes, auxiliary discriminative pretext tasks, and sophisticated data augmentation to achieve outstanding generalization and embedding quality. Its benchmark performance and surprising transfer capabilities redefine architectural and methodological best practices for automated acoustic analysis in ecology and conservation science (Merriënboer et al., 6 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Perch 2.0: The Bittern Lesson for Bioacoustics (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Perch 2.0.

Perch 2.0: Advanced Multi-taxa Bioacoustics

1. Training Paradigm and Architecture

2. Data Set Expansion and Input Construction

3. Training Objectives and Loss Design

4. Evaluation: Benchmarks and Transfer Performance

BirdSet

BEANS

Transfer Learning: Marine Audio

Table: Performance Summary Across Benchmarks

5. Model Compactness and Practicality

6. Broader Methodological Implications

7. Applications and Impact in Bioacoustics

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Perch 2.0: Advanced Multi-taxa Bioacoustics

1. Training Paradigm and Architecture

2. Data Set Expansion and Input Construction

3. Training Objectives and Loss Design

4. Evaluation: Benchmarks and Transfer Performance

BirdSet

BEANS

Transfer Learning: Marine Audio

Table: Performance Summary Across Benchmarks

5. Model Compactness and Practicality

6. Broader Methodological Implications

7. Applications and Impact in Bioacoustics

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research