Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Online incremental learning for audio classification using a pretrained audio model (2508.20732v1)

Published 28 Aug 2025 in eess.AS

Abstract: Incremental learning aims to learn new tasks sequentially without forgetting the previously learned ones. Most of the existing incremental learning methods for audio focus on training the model from scratch on the initial task, and the same model is used to learn upcoming incremental tasks. The model is trained for several iterations to adapt to each new task, using some specific approaches to reduce the forgetting of old tasks. In this work, we propose a method for using generalizable audio embeddings produced by a pre-trained model to develop an online incremental learner that solves sequential audio classification tasks over time. Specifically, we inject a layer with a nonlinear activation function between the pre-trained model's audio embeddings and the classifier; this layer expands the dimensionality of the embeddings and effectively captures the distinct characteristics of sound classes. Our method adapts the model in a single forward pass (online) through the training samples of any task, with minimal forgetting of old tasks. We demonstrate the performance of the proposed method in two incremental learning setups: one class-incremental learning using ESC-50 and one domain-incremental learning of different cities from the TAU Urban Acoustic Scenes 2019 dataset; for both cases, the proposed approach outperforms other methods.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces an online incremental learning framework for audio classification that leverages fixed pretrained embeddings with a nonlinear expansion layer to improve class separability.
  • It applies prototype-based classification by incrementally updating Gram and class prototype matrices without iterative fine-tuning or accessing previous task data.
  • Experimental evaluations on ESC-50 and TAU datasets demonstrate higher average accuracy and reduced forgetting compared to baseline methods.

Online Incremental Learning for Audio Classification Using a Pretrained Audio Model

Introduction

This paper addresses the challenge of online incremental learning for audio classification, focusing on both class-incremental learning (CIL) and domain-incremental learning (DIL) scenarios. The proposed method leverages generalizable audio embeddings from a fixed, pretrained model (PANNs CNN14) and introduces a nonlinear expansion layer to enhance the discriminability of these embeddings. Unlike prior approaches that require iterative fine-tuning or access to previous task data, this method adapts to new tasks in a single forward pass, minimizing catastrophic forgetting and computational overhead. Figure 1

Figure 1

Figure 1: Overview of the proposed method. (a) Extracted features f\mathbf{f} from a frozen pre-trained model P\mathcal{P} for given input samples x\mathbf{x} are projected into a higher-dimensional space using frozen random weights W\mathbf{W}, followed by a nonlinear activation function. (b) Gram (G\mathbf{G}) and class prototypes (K\mathbf{K}) matrices are iteratively updated and used to compute Wo\mathbf{W}_\text{o} by matrix inversion at each incremental task for prediction of classes seen so far.

Methodology

Incremental Learning Setup

The incremental learning protocol is defined over a sequence of TT supervised audio classification tasks D1,,DT\mathcal{D}_1, \ldots, \mathcal{D}_T. In CIL, each task introduces disjoint class labels, while in DIL, the same set of classes is present across tasks but with domain shifts (e.g., different cities). The model is restricted from accessing previous task data, enforcing a realistic continual learning constraint.

Feature Expansion and Nonlinear Projection

For each input sample, embeddings ft,mRH\mathbf{f}_{t,m} \in \mathbb{R}^H are extracted from the frozen pretrained model. These embeddings are projected into a higher-dimensional space (QQ) using a random matrix WRH×Q\mathbf{W} \in \mathbb{R}^{H \times Q}, followed by a nonlinear activation (ReLU):

vt,m=ψ(ft,mW)\mathbf{v}_{t,m} = \psi(\mathbf{f}_{t,m}^\top \mathbf{W})

This expansion increases the representational capacity and improves class separability, as confirmed by ablation studies.

Prototype-Based Classification with Decorrelated Weights

The method maintains a Gram matrix G\mathbf{G} and a class prototypes matrix K\mathbf{K}, updated incrementally:

G=t=1Tm=1Mtvt,mvt,m\mathbf{G} = \sum_{t=1}^T \sum_{m=1}^{M_t} \mathbf{v}_{t,m} \otimes \mathbf{v}_{t,m}

K=t=1Tm=1Mtvt,myt,m\mathbf{K} = \sum_{t=1}^T \sum_{m=1}^{M_t} \mathbf{v}_{t,m} \otimes \mathbf{y}_{t,m}

To mitigate prototype correlation, the classification weight matrix is computed via regularized inversion:

Wo=(G+λI)1K\mathbf{W}_\text{o} = (\mathbf{G} + \lambda \mathbf{I})^{-1} \mathbf{K}

where λ\lambda is selected per task to minimize validation error. Prediction is performed by multiplying the test embedding with Wo\mathbf{W}_\text{o}.

Experimental Setup

Datasets

  • CIL: ESC-50 (50 classes, 5 tasks, each with 10 classes)
  • DIL: TAU Urban Acoustic Scenes 2019 (10 classes, 9 domains/cities, each as a task)

Baselines

  • Linear Probe (LP): Trained per task, no access to previous data.
  • Joint Linear Probe (JLP): Trained on all data seen so far (not strictly incremental).
  • Nearest Class Mean (NCM): Prototype-based, using averaged embeddings.

Implementation Details

  • Embeddings: 2048-dim from PANNs CNN14
  • Projection dimension QQ: 8192 (selected via ablation)
  • Nonlinearity: ReLU
  • Training: Single forward pass (online); baselines also evaluated in offline (multi-epoch) mode

Results

Performance Comparison

The proposed method achieves superior average accuracy and minimal forgetting in both CIL and DIL setups, outperforming all baselines, including the joint linear probe. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Average accuracy and forgetting of the methods after learning the task Dt\mathcal{D}_t. Average accuracy (a) and forgetting (b) of the current Dt\mathcal{D}_t and previously seen tasks in a CIL setup; Average accuracy (c) and forgetting (d) of the current Dt\mathcal{D}_t and previously seen tasks in a DIL setup.

  • CIL (ESC-50): Final average accuracy AATAA_T of 93.4% (proposed) vs. 91.5% (JLP) and 88.5% (NCM); forgetting FRTFR_T of 2.5% (proposed).
  • DIL (TAU): Final average accuracy AATAA_T of 61.4% (proposed) vs. 60.3% (JLP) and 47.7% (NCM); forgetting FRTFR_T of 2.0% (proposed).

The method demonstrates strong stability-plasticity trade-off, maintaining high accuracy on previously learned tasks while adapting to new ones.

Ablation Studies

Ablations confirm the necessity of both projection to higher dimensions and nonlinear activation. Removing either degrades performance significantly. Figure 3

Figure 3

Figure 3

Figure 3: Impact of the proposed method compared to alternatives without projection to QQ dimension using Eq. (\ref{eq1}) and without ReLU. (a) Average accuracy of CIL setup; (b) average accuracy of DIL setup.

Effect of Projection Dimension

Increasing QQ improves accuracy but increases computational cost. Q=8192Q=8192 is selected as a practical trade-off. Figure 4

Figure 4

Figure 4

Figure 4: Performance of the proposed method using different QQ dimensional values. (a) Average accuracy of CIL setup; (b) average accuracy of DIL setup.

Implementation Considerations

  • Computational Requirements: The method requires updating Q×QQ \times Q and Q×CQ \times C matrices per task. For Q=8192Q=8192 and C=50C=50, this is tractable compared to the 80.8M parameters of the CNN14 backbone.
  • Deployment: The approach is suitable for resource-constrained or privacy-sensitive scenarios, as it does not require storage of previous data or retraining of the backbone.
  • Scalability: The method is fully online, requiring only a single pass through new task data, making it amenable to streaming or real-time applications.

Implications and Future Directions

The results demonstrate that fixed pretrained audio models, when augmented with nonlinear expansion and decorrelated prototype-based classification, can serve as effective continual learners in both class- and domain-incremental settings. This approach obviates the need for iterative fine-tuning or replay buffers, simplifying deployment in practical scenarios.

Future work may explore:

  • Application to other pretrained audio models (e.g., AST, PaSST, SSAST)
  • Extension to multi-label or event detection tasks
  • Optimization of projection dimension and regularization for large-scale deployments
  • Integration with parameter-efficient transfer learning techniques

Conclusion

This paper presents a unified, online incremental learning framework for audio classification that leverages fixed pretrained embeddings and nonlinear expansion for robust adaptation to new tasks. The method achieves state-of-the-art performance in both CIL and DIL setups, with minimal forgetting and efficient computation, making it highly suitable for real-world continual learning applications in audio.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com