Spectrogram Token Skip-Gram for Bioacoustics
- Spectrogram Token Skip-Gram (STSG) is a lightweight framework that models bioacoustic sequences by discretizing spectrograms and training static skip-gram embeddings.
- It employs unsupervised clustering and PCA-driven tokenization to convert audio into efficient, discrete tokens while preserving key temporal and contextual features.
- Validated on the BirdCLEF+ 2025 challenge, STSG achieves rapid inference (~6 minutes for 700 minutes of audio) making it ideal for resource-constrained environments.
The Spectrogram Token Skip-Gram (STSG) is a lightweight computational framework for bioacoustic sequence modeling and classification. Developed for the BirdCLEF+ 2025 challenge, it is designed to address stringent computational constraints, notably a 90-minute CPU-only inference deadline for large-scale soundscape classification. By discretizing spectrogram features and leveraging static contextual embeddings trained through a skip-gram model, STSG enables efficient inference with minimal accuracy trade-offs, making it well-suited for resource-limited settings (2507.08236).
1. Methodological Overview
STSG reconceptualizes bioacoustic classification as a sequence modeling problem, replacing deep convolutional and transformer-based architectures with a multi-stage pipeline that distills audio into discrete spectrogram tokens, derives static contextual embeddings, and applies a linear classification head. The processing workflow comprises three main stages: (1) spectrogram tokenization via unsupervised clustering; (2) training of contextual token embeddings using the skip-gram algorithm; (3) aggregation and classification of embeddings within short temporal windows. This construction dramatically reduces computational overhead while retaining temporal and contextual information in the audio representation.
2. Spectrogram Tokenization
The STSG tokenization stage operates as follows:
- Mel-Spectrogram Extraction: Raw audio is converted to Mel-spectrograms using specified parameters—32,000 Hz sample rate, 8,000-sample window, 50% overlap, and 768 Mel bands. Each resulting "frame" corresponds to 0.125 seconds of audio.
- Normalization and Dimensionality Reduction: Spectrogram frames are normalized for consistent decibel scaling. Dimensionality is reduced via Principal Component Analysis (PCA), with 128 components typically retaining approximately 87% of the variance, thereby mitigating noise and expense.
- Clustering: Using Faiss’s approximate K-means clustering (employing HNSW for efficient search), all PCA-reduced frames from the training corpus are quantized into clusters. Each cluster represents a unique integer token, forming a vocabulary (e.g., 16,000 clusters) that covers the input feature space.
This process yields a codebook mapping each 0.125-second spectrogram frame to a discrete token, forming a compact temporal sequence suitable for downstream modeling.
3. Skip-Gram Embedding Training
Once tokenized, audio data is modeled as a sequence of discrete symbols. STSG utilizes the Word2Vec skip-gram objective to train static contextual embeddings for these tokens in an unsupervised fashion. Given a target token and a context token , the training objective is expressed as:
where is the number of negative samples, denotes the sigmoid function, and each negative context token is sampled from a modified unigram distribution:
with as token frequency and as the smoothing exponent hyperparameter. The rationale is that tokens co-occurring in similar temporal contexts are embedded closely, while negative samples are distributed apart. Hyperparameters such as embedding dimensionality, context window size, and negative sample count are tuned for optimal balance between performance and efficiency.
4. Classification Pipeline and Techniques
Following skip-gram training, each spectrogram token is mapped to a fixed embedding vector. For audio classification:
- Temporal Segmentation: Audio is partitioned into 5-second intervals, each containing 40 tokens (8 per second).
- Feature Aggregation: Embeddings within each window are averaged, producing a fixed-dimensional vector.
- Classification Head: The aggregate feature is passed to a simple linear model (possibly augmented with a hidden layer, ReLU, and batch normalization) to generate class probabilities.
- Student-Teacher Pretraining Variant: An optional scheme is proposed where a 1D-CNN student model, with STSG embeddings as input, is pretrained to mimic soft target logits from a larger teacher network by minimizing KL divergence.
This approach achieves high throughput as all transformations, from token lookup to aggregation and linear prediction, are computationally trivial.
5. Computational and Inference Efficiency
A defining aspect of STSG is its extreme inference efficiency. Tokenization through Faiss-based lookup and static embedding retrieval can be executed with minimal computational burden. Empirical results demonstrate an average inference time of 0.5 seconds per file, totaling approximately 6 minutes for a 700-minute test set. This performance represents a substantial improvement compared to mainstream models such as Perch (16 minutes for the same set) and considerably faster than many deep learning counterparts, thus satisfying the BirdCLEF+ 2025 competition’s strict constraints.
6. Empirical Performance and Comparative Metrics
The effectiveness of STSG was established through ROC-AUC and F1-score evaluations on the BirdCLEF+ 2025 dataset (206 species: birds, mammals, insects, amphibians):
Model | Public ROC-AUC | Private ROC-AUC | Inference Time (700 min) |
---|---|---|---|
STSG | 0.559 | 0.520 | ~6 min |
Perch (optimized) | 0.729 | 0.711 | ~16 min |
BirdSetEfficientNetB1 | 0.810 | 0.778 | (longer) |
Although absolute accuracy lags behind larger models, the trade-off is justified by the substantial speed and resource savings, especially critical in constrained deployment scenarios.
7. Implementation, Code, and Impact
Implementation code for STSG, including all preprocessing, tokenization, skip-gram training, and classification steps, is publicly released at https://github.com/dsgt-arc/birdclef-2025. This release enables reproducibility and serves as a basis for extensions and benchmarking of lightweight audio sequence models in bioacoustics.
A plausible implication is that the STSG pipeline structure, while optimized for the BirdCLEF+ scenario, may be extensible to other domains where fast, resource-efficient audio classification is necessary and moderate accuracy trade-offs are acceptable. The modularity and transparency of the approach facilitate adaptation and integration with alternative embedding or tokenization strategies under similar constraints.