BEANS Benchmark for Bioacoustics

Updated 10 December 2025

BEANS Benchmark is a unified, multi-dataset evaluation suite that standardizes ML research in bioacoustics through reproducible tasks and metrics.
It supports varied tasks including classification, detection, open-ended captioning, and zero-shot evaluations across numerous species.
The benchmark fosters progress by providing open-source baselines, fixed splits, and fair comparisons between classical methods and modern foundation models.

BEANS (Benchmark of Animal Sounds) is a standardized, multi-dataset, multi-task benchmark suite for the systematic evaluation of machine learning models in computational bioacoustics. Designed to address the fragmentation of ML applications in biology, BEANS and its extension BEANS-Zero provide curated, public datasets with unified annotation and evaluation standards, supporting quantitative assessment of algorithms for animal sound classification, detection, and auxiliary behaviors such as counting and captioning. The BEANS series has catalyzed progress in generalization across taxa and tasks, and has served as the foundation for comparing both classical and foundation models in bioacoustic research (Hagiwara et al., 2022, Robinson et al., 2024, Tang et al., 3 Dec 2025, Schwinger et al., 2 Aug 2025).

1. Origin, Purpose, and Evolution

BEANS was first introduced to establish a common ground for quantifying ML progress in bioacoustics by standardizing datasets, tasks, and evaluation (Hagiwara et al., 2022). Its motivation stems from the fragmentation and lack of comparability in bioacoustic ML studies, which historically relied on diverse, small, and inconsistent datasets. The objectives of BEANS include:

Providing a unified collection of public datasets covering diverse animal taxa (birds, mammals, anurans, insects).
Defining reproducible supervised tasks with fixed splits and standardized evaluation.
Supplying open-source baselines and metric computation tools.
Facilitating out-of-the-box development and evaluation of generic, cross-species models.

The BEANS benchmark was subsequently extended with BEANS-Zero (Robinson et al., 2024), specifically advancing the field through the introduction of task diversity (multi-task, zero-shot, captioning, and counting) and challenging generalization protocols (zero-shot splits across hundreds of species).

2. Dataset Structure and Task Taxonomy

BEANS aggregates datasets encompassing a wide phylogenetic range and recording conditions. The original benchmark included 12 datasets—5 for classification, 5 for event detection, and 2 auxiliary non-bioacoustic datasets—for a total coverage of birds, land and marine mammals, insects, and other taxa (Hagiwara et al., 2022, Schwinger et al., 2 Aug 2025).

Task Types:

Classification (CLS): Single-label, multi-class assignment to audio clips.
Detection (DET): Multi-label, multi-class event detection in sliding windows over long recordings.
Novel tasks in BEANS-Zero: Bird lifestage, call-type (call vs. song), open-ended captioning, and zebra finch individual counting ("zf-indv"), each with unique label types and biological relevance (Robinson et al., 2024).

BEANS-Zero expands task coverage and introduces species-level zero-shot splits:

Dataset/Task	Samples	Label Kind	Zero-Shot?
esc50 (CLS)	400	50 environmental sounds	No
watkins (CLS)	339	31 marine mammal species	No
cbi (CLS)	3,620	264 bird species	No
humbugdb (CLS)	1,859	14 mosquito species	No
dcase (DET)	13,688	20 bird/mammal species	No
enabirds (DET)	4,543	34 bird labels	No
unseen-cmn (CLS)	931	300 bird/other species	Yes
unseen-sci (CLS)	931	300 bird/other species	Yes
lifestage (CLS)	493	3 (juvenile, adult, other)	No
call-type (CLS)	15,439	2 (call, song)	No
captioning (CAP)	29,002	Open-ended text	No
zf-indv (CLS)	2,346	4 (# zebra finch)	No

Annotation sources include expert labels, automated metadata extraction, and LLM-generated captions (Robinson et al., 2024).

3. Evaluation Protocols and Metrics

BEANS imposes strictly defined training/validation/test splits, prohibiting information leakage for fair comparison (Hagiwara et al., 2022).

Classification:

Given $N$ test samples: $\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(y_i = \hat y_i)$

Detection:

Multi-label detection is evaluated using macro-averaged F1-score or mean average precision (mAP):

$F_{1, \mathrm{macro}} = \frac{1}{C} \sum_{c=1}^C F_1^{(c)}$

$\mathrm{mAP} = \frac{1}{11C} \sum_{c=1}^C \sum_{r\in\{0,0.1,\dots,1.0\}} P_c(r)$

Captioning:

Open-ended captions are evaluated using SPIDEr, a composite of CIDEr and SPICE metrics (Robinson et al., 2024): $\mathrm{SPIDEr} = 0.5 \times (\mathrm{CIDEr} + \mathrm{SPICE})$

For novel zero-shot splits (e.g., unseen-cmn/unseen-sci), all labels from hundreds of species are withheld from model training to test "out-of-the-box" generalization. Post-processing matches text outputs to labels using Levenshtein distance (Robinson et al., 2024).

4. Baseline Methods and State-of-the-Art Comparisons

Baseline evaluations in BEANS include non-neural models (e.g., SVM, XGBoost on MFCC features), standard CNNs (ResNet18/50/152), VGGish pretrained on AudioSet, and Transformer-based audio encoders (Hagiwara et al., 2022, Tang et al., 3 Dec 2025, Schwinger et al., 2 Aug 2025).

Advanced evaluations leverage foundation models such as BEATs, BirdMAE, and large audio-LLMs (e.g., NatureLM-audio (Robinson et al., 2024)).

Key Protocols:

Linear Probing (LP): A frozen encoder's clip-level embedding is classified via a linear layer trained on BEANS (Schwinger et al., 2 Aug 2025).
Attentive Probing (AP): Retains patch-wise features and applies a trainable self-attention head before classification, unlocking additional representational power in Transformer-based models.

Performance Summary (Top-1 Accuracy, AP indicates attentive probing) (Schwinger et al., 2 Aug 2025):

Model	WTK	BAT	CBI	DOG	HUM	Avg
ConvNeXt (LP)	98.9	93.7	98.9	99.3	96.2	97.4
Perch (LP)	98.4	89.0	99.0	99.5	95.6	96.3
BEATs $_{\rm NLM}$ (AP)	99.5	96.9	98.9	99.8	97.8	98.6
BirdMAE (AP)	99.5	96.8	98.0	99.3	97.3	98.2

NatureLM-audio establishes new state-of-the-art results in BEANS-Zero, including zero-shot classification (e.g., 11.6% on unseen-cmn vs. 3.4% for CLAP; 0.755 accuracy on cbi) and open-captioning (SPIDEr=0.494 vs. 0.009 for SALMONN) (Robinson et al., 2024).

5. Model Design Insights and Adaptation Practices

BEANS benchmarks have informed several model architecture and adaptation strategies:

Domain-general foundation models (e.g., audio Transformers with large-scale self-supervised pretraining) can match or outperform taxon-specific models in cross-taxon settings when attentive head adaptation is employed (Schwinger et al., 2 Aug 2025).
For pure bird identification (CBI subset), supervised bird models (ConvNeXt $_{BS}$ , Perch, BirdMAE) provide superior linear probing performance.
Lightweight adaptation (training only classifier heads, LP or AP) suffices to reach strong performance, minimizing the need for full fine-tuning.
State-space sequence models (BioMamba) offer competitive accuracy to Transformers with lower VRAM scaling, beneficial for long bioacoustic recordings (Tang et al., 3 Dec 2025).

6. Limitations, Challenges, and Future Directions

Limitations identified through BEANS and BEANS-Zero evaluations include:

Poor performance persists for underrepresented taxa (e.g., mosquitoes, rare underwater noise) (Robinson et al., 2024).
Existing metrics may be suboptimal for heavily imbalanced or open-set recognition scenarios.
Detection tasks with overlapping or simultaneous calls remain challenging and may require integrated source separation (Hagiwara et al., 2022).

Proposed future directions:

Expanding taxonomic and ecological coverage, especially toward amphibians, invertebrates, and nonsong vocalizations (Robinson et al., 2024).
Adding context and behavior labels, and tackling complex soundscapes with multiple, overlapping vocalizations.
Introducing multi-lingual and cross-modal (audio/text) evaluation.
Incorporating generative tasks such as "denoise and reconstruct" for probing model synthesis capacities.
Developing more fine-grained time-localization and unsupervised discovery protocols (Hagiwara et al., 2022, Tang et al., 3 Dec 2025).

7. Benchmark Ecosystem, Codebase, and Community

The BEANS codebase provides dataset management scripts, preprocessing pipelines, training/evaluation recipes, and a public leaderboard. Datasets are standardized (mono/WAV, fixed sample rates, consistent splits) and annotation scripts unify label formats. The benchmark is open-source and available at https://github.com/earthspecies/beans (Hagiwara et al., 2022).

Typical workflow involves data preparation, optional feature extraction, model training/evaluation, and leaderboard submission. The suite is designed for reproducibility, extensibility, and fair comparison, serving as an anchor for annual challenges and collaborative extension (Hagiwara et al., 2022).

References:

(Hagiwara et al., 2022) BEANS: The Benchmark of Animal Sounds
(Robinson et al., 2024) NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
(Tang et al., 3 Dec 2025) State Space Models for Bioacoustics: A comparative Evaluation with Transformers
(Schwinger et al., 2 Aug 2025) Foundation Models for Bioacoustics – a Comparative Review

Markdown Upgrade to Chat

References (4)

BEANS: The Benchmark of Animal Sounds (2022)

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics (2024)

State Space Models for Bioacoustics: A comparative Evaluation with Transformers (2025)

Foundation Models for Bioacoustics -- a Comparative Review (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEANS Benchmark.