BirdSet Benchmark: Avian Audio & Vision

Updated 18 November 2025

BirdSet Benchmark is a comprehensive framework featuring large-scale, multi-modal datasets for avian bioacoustics and UAV-based tracking.
It supports rigorous evaluation through multi-label audio classification across nearly 10,000 species and advanced vision tracking in varied ecological settings.
The framework drives advances in conservation and machine learning by employing specialized metrics such as cmAP, AUROC, and SO-HOTA for robust performance assessment.

The BirdSet Benchmark defines a suite of large-scale, standardized datasets and evaluation protocols for bird-focused audio classification and, more recently, vision-based tracking tasks. Its primary objective is to enable rigorous, repeatable benchmarking of machine learning models under realistic acoustic and visual conditions reflecting the diversity, data imbalance, and domain shift inherent in avian monitoring. The BirdSet framework encompasses both multi-label audio classification (across nearly 10,000 species and hundreds of hours of annotated soundscapes) and UAV-based multi-object tracking of wild birds, supporting both supervised and self-supervised learning paradigms, few-shot adaptation, and interpretable modeling approaches. By providing diverse and richly labeled datasets—alongside specialized metrics such as cmAP, AUROC, and SO-HOTA—BirdSet catalyzes advances in biodiversity monitoring, conservation, and applied machine learning for environmental audio and video analysis.

1. Dataset Structure and Composition

BirdSet’s core audio benchmark unifies nearly 10,000 avian species, over 528,000 labelled focal recordings, and 448 hours of strongly labeled soundscape evaluations within a single protocol (Rauch et al., 2024). Its main training partition (“XCL”) is drawn from the Xeno-Canto collection, comprising 528,434 weakly-annotated audio clips covering 9,734 species and around 7,200 hours. Eight passive soundscape test splits span diverse biomes (Amazon Basin, Andes, Pacific, North America, Eurasia, Pacific islands), each offering species ground-truth at 5 s granularity with multi-label annotation.

Test set examples:

Split	#Species	#5s Clips	Hours
PER	132	15,120	21.0
NES	89	24,480	34.0
UHH	25	36,637	50.9
HSN	21	12,000	16.7
NBP	51	563	0.8
POW	48	4,560	6.3
SSW	81	205,200	285
SNE	56	23,756	33.0

The benchmark also incorporates a background noise set (VOX) and three “focal” training splits (small, medium, full). Annotation for soundscape evaluation uses second-level bounding boxes for event onsets/offsets, transformed for task consistency into fixed 5 s multi-label clips (Rauch et al., 2024).

The vision extension to BirdSet, SMOT4SB, is designed specifically for UAV-based small multi-object tracking. It consists of 211 sequences and 108,192 annotated frames at native 1920×1080 px and 3840×2160 px resolutions, with 371,690 bounding boxes and 2,240 unique tracks. Environmental and behavioral diversity is explicitly encoded: birds are tracked across a variety of habitats (forests, agricultural, urban) and conditions (lighting, flocking, UAV and subject motion) (Kondo et al., 17 Jul 2025).

2. Task Definitions and Supported Evaluation Protocols

BirdSet’s canonical audio task is multi-label classification: given a 5 s clip $x$ , predict binary presence/absence of each species $y\in\{0,1\}^C$ . Scenarios include end-to-end supervised learning, few-shot adaptation with as few as 1–10 labeled examples per class, and self-supervised pretraining (e.g., masked prediction or contrastive objectives) (Rauch et al., 17 Apr 2025).

In vision, the task is Small Multi-Object Tracking (SMOT) for freely moving birds against moving UAV backgrounds, with unique persistent IDs and axis-aligned bounding box annotations. The benchmark mandates strict video-level data splits, use of standard COCO formatting, and explicit train/public-test/private-test partitions to support unseen generalization (Kondo et al., 17 Jul 2025).

BirdSet protocols emphasize domain shift: models must generalize from focal, high-SNR directional training data to omnidirectional, noisy, multi-source, multi-label test domains with often severely skewed species distributions (Schwinger et al., 11 Nov 2025).

3. Evaluation Metrics and Quantitative Benchmarks

Metrics are chosen for multi-label, long-tailed, and covariate-shifted settings:

Classwise Mean Average Precision (cmAP):

$\mathrm{cmAP} = \frac{1}{C}\sum_{c=1}^C \mathrm{AP}_c$

where $\mathrm{AP}_c$ is area under the precision–recall curve for class $c$ (Rauch et al., 2024).

Macro-Averaged AUROC:

$\text{AUROC}_{\mathrm{mean}} = \frac1C\sum_{c=1}^C \mathrm{AUROC}_c$

Macro-F1 and label-free metrics for robust performance assessment under extreme class imbalance.
SO-HOTA (Small Object HOTA) for SMOT4SB replaces conventional HOTA’s IoU-matching with Dot Distance:
- $\mathrm{DotD}(A,B) = \exp\left(-\frac{D}{S}\right)$
- where $D$ is center distance and $S$ is the root-mean-area.
- SO-HOTA aggregates $TP_{\mathrm{DotD}}$ based detection and association at thresholds $\alpha\in\{0.05,...,0.95\}$ :
$\mathrm{SO\textrm{-}HOTA}_\alpha = \sqrt{\mathrm{SO\textrm{-}DetA}_\alpha \cdot \mathrm{SO\textrm{-}AssA}_\alpha}$

$\mathrm{SO\textrm{-}HOTA} = \frac{1}{19} \sum_\alpha \mathrm{SO\textrm{-}HOTA}_\alpha$

(Kondo et al., 17 Jul 2025)
Calibration metrics (Expected Calibration Error, Miscalibration Score) capture reliability under prediction uncertainty, crucial in rare-species and low-SNR conditions (Schwinger et al., 11 Nov 2025).

4. Baseline Models and State-of-the-Art Results

For audio, baseline and top-performing methods include:

Deep CNN and Transformer architectures (EfficientNet-B3, ConvNeXt-BS, AST, ViT) (Rauch et al., 2024).
Self-supervised models: Bird-MAE (domain-adapted Masked Autoencoder) and AudioProtoPNet (interpretable prototype learning model) (Rauch et al., 17 Apr 2025, Heinrich et al., 2024).
Linear and attentive probing, as well as parameter-efficient prototypical probing, are supported for foundation model adaptation (Schwinger et al., 2 Aug 2025).

Representative results:

Bird-MAE ViT-L/16: establishes new SOTA, mAP 55.3% on the POW split, outperforming Perch by absolute gains up to +16.2% on some splits (Rauch et al., 17 Apr 2025).
AudioProtoPNet: achieves AUROC 0.842 and cmAP 0.675 on seven-split averages, exceeding Perch by +3.3% (AUROC) and +6.5% (cmAP) (Heinrich et al., 2024).
Calibration: Perch v2 and ConvNeXt_BS exhibit global underconfidence (ECE ≈ 1%), AudioProtoPNet and BirdMAE tend towards overconfidence (ECE up to 7.6%) (Schwinger et al., 11 Nov 2025).
SMOT4SB baseline: YOLOX+OC-SORT reaches SO-HOTA 9.90; challenge winner achieves 50.59 (a 5.1× improvement) via YOLOv8-SOD+YOLOv8-SMOT tracker with patch-based “SliceTrain” augmentation (Kondo et al., 17 Jul 2025).

5. Domain Shift, Data Augmentation, and Best Practices

BirdSet explicitly models and quantifies domain shift: the generalization challenge from directional, single-class, high-quality training audio to complex, overlapping, multi-source, and low-SNR soundscape evaluation (Rauch et al., 17 Apr 2025, Schwinger et al., 11 Nov 2025). In object tracking, the shift is from typical planar, single-object, or urban tracking (e.g., VisDrone, UAVDT) towards naturalistic, 3D “motion entanglement” (free UAV plus bird movement, flocking) (Kondo et al., 17 Jul 2025).

Best practices established in the literature include:

SliceTrain: patch-based samplers for small/faint object emphasis in vision (Kondo et al., 17 Jul 2025).
Copy/Paste augmentation: synthesis of rare poses and distraction backgrounds using real or synthetic bird cutouts.
Temporal and motion-aware trackers: integration of motion-direction EMA, affine compensation, and multi-detector ensembling (adaptive weighted fusion) for robust tracking (Kondo et al., 17 Jul 2025).
Audio data augmentations: SpecAugment, MixUp, and augmentation in spectrogram or waveform space.
Video-level splits and submission limits: to prevent overfitting and HARKing, maintaining valid generalization assessment (Kondo et al., 17 Jul 2025).
Reporting protocols: models must supply cmAP, AUROC, SO-HOTA, and calibration plots for comprehensive evaluation.

6. Interpretability and Model Adaptation

BirdSet has catalyzed the development of interpretable machine learning methods tailored for avian monitoring. Notably:

AudioProtoPNet provides prototype traceability, allowing event-level and class-level human inspection via back-projection of prototype activations onto spectrograms. Local and global explanations facilitate scientific diagnosis and knowledge transfer (Heinrich et al., 2024).
Bird-MAE prototypical probing: adapts masked autoencoder representations to few-shot and low-resource tasks, allowing parameter-efficient downstream adaptation within 3–5% mAP of full fine-tuning, and significantly exceeding linear/MLP probe performance (up to +37% mAP on select tasks) (Rauch et al., 17 Apr 2025).
Calibration analysis: class-level and split-level reliability assessments identify under- and overconfidence patterns, enabling domain-specific recalibration via (for example) Platt/temperature scaling with minimal labeled sets (Schwinger et al., 11 Nov 2025).

7. Position Within the Bioacoustic Benchmarking Landscape

BirdSet extends coverage far beyond AudioSet by providing a nearly 18.5× class increase (9,734 bird species vs. 527 classes in AudioSet) and a rich diversity of real-world, strongly-labeled evaluation deployments (448 h in eight splits vs. 60 h in AudioSet-strong sets) (Rauch et al., 2024). The benchmark’s unified labeling schema, focus on both strong- and weak-label settings, and its extensibility with vision-based tracking (SMOT4SB) distinguish it from earlier single-region or single-modality datasets such as LaTOT, VISO, and VisDrone (Kondo et al., 17 Jul 2025).

By supporting robust generalization evaluation, self-supervised learning, data efficiency studies, prototype-based interpretability, and precise calibration assessment, BirdSet sets the standard for algorithmic development and comparative research in avian monitoring and related bioacoustic fields (Schwinger et al., 2 Aug 2025, Schwinger et al., 11 Nov 2025, Heinrich et al., 2024, Kondo et al., 17 Jul 2025).

References:

"BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics" (Rauch et al., 2024)
"MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results" (Kondo et al., 17 Jul 2025)
"Can Masked Autoencoders Also Listen to Birds?" (Rauch et al., 17 Apr 2025)
"AudioProtoPNet: An interpretable deep learning model for bird sound classification" (Heinrich et al., 2024)
"Foundation Models for Bioacoustics -- a Comparative Review" (Schwinger et al., 2 Aug 2025)
"Uncertainty Calibration of Multi-Label Bird Sound Classifiers" (Schwinger et al., 11 Nov 2025)