Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 23 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 183 tok/s Pro

2000 character limit reached

Online incremental learning for audio classification using a pretrained audio model (2508.20732v1)

Published 28 Aug 2025 in eess.AS

Abstract: Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same model is used to learn upcoming incremental tasks. The model is trained for several iterations to adapt to each new task, using some specific approaches to reduce the forgetting of old tasks. In this work, we propose a method for using generalizable audio embeddings produced by a pre-trained model to develop an online incremental learner that solves sequential audio classification tasks over time. Specifically, we inject a layer with a nonlinear activation function between the pre-trained model's audio embeddings and the classifier; this layer expands the dimensionality of the embeddings and effectively captures the distinct characteristics of sound classes. Our method adapts the model in a single forward pass (online) through the training samples of any task, with minimal forgetting of old tasks. We demonstrate the performance of the proposed method in two incremental learning setups: one class-incremental learning using ESC-50 and one domain-incremental learning of different cities from the TAU Urban Acoustic Scenes 2019 dataset; for both cases, the proposed approach outperforms other methods.

Collections

Summary

The paper introduces an online incremental learning framework for audio classification that leverages fixed pretrained embeddings with a nonlinear expansion layer to improve class separability.
It applies prototype-based classification by incrementally updating Gram and class prototype matrices without iterative fine-tuning or accessing previous task data.
Experimental evaluations on ESC-50 and TAU datasets demonstrate higher average accuracy and reduced forgetting compared to baseline methods.

Online Incremental Learning for Audio Classification Using a Pretrained Audio Model

Introduction

This paper addresses the challenge of online incremental learning for audio classification, focusing on both class-incremental learning (CIL) and domain-incremental learning (DIL) scenarios. The proposed method leverages generalizable audio embeddings from a fixed, pretrained model (PANNs CNN14) and introduces a nonlinear expansion layer to enhance the discriminability of these embeddings. Unlike prior approaches that require iterative fine-tuning or access to previous task data, this method adapts to new tasks in a single forward pass, minimizing catastrophic forgetting and computational overhead.

Figure 1: Overview of the proposed method. (a) Extracted features $\mathbf{f}$ from a frozen pre-trained model $\mathcal{P}$ for given input samples $\mathbf{x}$ are projected into a higher-dimensional space using frozen random weights $\mathbf{W}$ , followed by a nonlinear activation function. (b) Gram ( $\mathbf{G}$ ) and class prototypes ( $\mathbf{K}$ ) matrices are iteratively updated and used to compute $\mathbf{W}_\text{o}$ by matrix inversion at each incremental task for prediction of classes seen so far.

Methodology

Incremental Learning Setup

The incremental learning protocol is defined over a sequence of $T$ supervised audio classification tasks $\mathcal{D}_1, \ldots, \mathcal{D}_T$ . In CIL, each task introduces disjoint class labels, while in DIL, the same set of classes is present across tasks but with domain shifts (e.g., different cities). The model is restricted from accessing previous task data, enforcing a realistic continual learning constraint.

Feature Expansion and Nonlinear Projection

For each input sample, embeddings $\mathbf{f}_{t,m} \in \mathbb{R}^H$ are extracted from the frozen pretrained model. These embeddings are projected into a higher-dimensional space ( $Q$ ) using a random matrix $\mathbf{W} \in \mathbb{R}^{H \times Q}$ , followed by a nonlinear activation (ReLU):

$\mathbf{v}_{t,m} = \psi(\mathbf{f}_{t,m}^\top \mathbf{W})$

This expansion increases the representational capacity and improves class separability, as confirmed by ablation studies.

Prototype-Based Classification with Decorrelated Weights

The method maintains a Gram matrix $\mathbf{G}$ and a class prototypes matrix $\mathbf{K}$ , updated incrementally:

$\mathbf{G} = \sum_{t=1}^T \sum_{m=1}^{M_t} \mathbf{v}_{t,m} \otimes \mathbf{v}_{t,m}$

$\mathbf{K} = \sum_{t=1}^T \sum_{m=1}^{M_t} \mathbf{v}_{t,m} \otimes \mathbf{y}_{t,m}$

To mitigate prototype correlation, the classification weight matrix is computed via regularized inversion:

$\mathbf{W}_\text{o} = (\mathbf{G} + \lambda \mathbf{I})^{-1} \mathbf{K}$

where $\lambda$ is selected per task to minimize validation error. Prediction is performed by multiplying the test embedding with $\mathbf{W}_\text{o}$ .

Experimental Setup

Datasets

CIL: ESC-50 (50 classes, 5 tasks, each with 10 classes)
DIL: TAU Urban Acoustic Scenes 2019 (10 classes, 9 domains/cities, each as a task)

Baselines

Linear Probe (LP): Trained per task, no access to previous data.
Joint Linear Probe (JLP): Trained on all data seen so far (not strictly incremental).
Nearest Class Mean (NCM): Prototype-based, using averaged embeddings.

Implementation Details

Embeddings: 2048-dim from PANNs CNN14
Projection dimension $Q$ : 8192 (selected via ablation)
Nonlinearity: ReLU
Training: Single forward pass (online); baselines also evaluated in offline (multi-epoch) mode

Results

Performance Comparison

The proposed method achieves superior average accuracy and minimal forgetting in both CIL and DIL setups, outperforming all baselines, including the joint linear probe.

Figure 2: Average accuracy and forgetting of the methods after learning the task $\mathcal{D}_t$ . Average accuracy (a) and forgetting (b) of the current $\mathcal{D}_t$ and previously seen tasks in a CIL setup; Average accuracy (c) and forgetting (d) of the current $\mathcal{D}_t$ and previously seen tasks in a DIL setup.

CIL (ESC-50): Final average accuracy $AA_T$ of 93.4% (proposed) vs. 91.5% (JLP) and 88.5% (NCM); forgetting $FR_T$ of 2.5% (proposed).
DIL (TAU): Final average accuracy $AA_T$ of 61.4% (proposed) vs. 60.3% (JLP) and 47.7% (NCM); forgetting $FR_T$ of 2.0% (proposed).

The method demonstrates strong stability-plasticity trade-off, maintaining high accuracy on previously learned tasks while adapting to new ones.

Ablation Studies

Ablations confirm the necessity of both projection to higher dimensions and nonlinear activation. Removing either degrades performance significantly.

Figure 3: Impact of the proposed method compared to alternatives without projection to $Q$ dimension using Eq. (\ref{eq1}) and without ReLU. (a) Average accuracy of CIL setup; (b) average accuracy of DIL setup.

Effect of Projection Dimension

Increasing $Q$ improves accuracy but increases computational cost. $Q=8192$ is selected as a practical trade-off.

Figure 4: Performance of the proposed method using different $Q$ dimensional values. (a) Average accuracy of CIL setup; (b) average accuracy of DIL setup.

Implementation Considerations

Computational Requirements: The method requires updating $Q \times Q$ and $Q \times C$ matrices per task. For $Q=8192$ and $C=50$ , this is tractable compared to the 80.8M parameters of the CNN14 backbone.
Deployment: The approach is suitable for resource-constrained or privacy-sensitive scenarios, as it does not require storage of previous data or retraining of the backbone.
Scalability: The method is fully online, requiring only a single pass through new task data, making it amenable to streaming or real-time applications.

Implications and Future Directions

The results demonstrate that fixed pretrained audio models, when augmented with nonlinear expansion and decorrelated prototype-based classification, can serve as effective continual learners in both class- and domain-incremental settings. This approach obviates the need for iterative fine-tuning or replay buffers, simplifying deployment in practical scenarios.

Future work may explore:

Application to other pretrained audio models (e.g., AST, PaSST, SSAST)
Extension to multi-label or event detection tasks
Optimization of projection dimension and regularization for large-scale deployments
Integration with parameter-efficient transfer learning techniques

Conclusion

This paper presents a unified, online incremental learning framework for audio classification that leverages fixed pretrained embeddings and nonlinear expansion for robust adaptation to new tasks. The method achieves state-of-the-art performance in both CIL and DIL setups, with minimal forgetting and efficient computation, making it highly suitable for real-world continual learning applications in audio.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (2)

Tweets

https://twitter.com/ArxivSound/status/1961339256984510819

alphaXiv

Online incremental learning for audio classification using a pretrained audio model (12 likes, 0 questions)