Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming (2211.01317v3)

Published 2 Nov 2022 in cs.SD, cs.AI, cs.LG, cs.NE, and eess.AS

Abstract: Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a pre-trained model from a source domain to a target domain by modifying the input of a frozen pre-trained model. In addition to the known, input-independent, reprogramming method, we propose an advanced reprogramming paradigm: Input-dependent NMR, to increase adaptability to complex input data such as musical audio. Experimental results suggest that a neural model pre-trained on large-scale datasets can successfully perform music genre classification by using this reprogramming method. The two proposed Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a small genre classification dataset.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces Neural Model Reprogramming to repurpose frozen pre-trained models for low-resource music genre classification.
It compares three methods—II-NMR, ID-NMR, and IDS-NMR—showing that IDS-NMR with AST delivers competitive results with far fewer parameters.
The approach leverages cross-modal transfer to reduce computational costs and improve performance in scenarios with limited labeled music data.

This paper (2211.01317) addresses the challenge of music genre classification in low-resource settings by proposing novel approaches based on Neural Model Reprogramming (NMR). Traditional methods often struggle due to the scarcity of large, labeled music datasets and the high computational cost of fine-tuning large pre-trained models from other domains. NMR offers an alternative by leveraging a pre-trained model without modifying its weights, instead learning a small, task-specific transformation applied to the input data or intermediate features.

The core idea of NMR, as adapted in this work, is to re-purpose a frozen pre-trained model (trained on a source task like speech recognition or general audio classification) for a target task (music genre classification). This is achieved by training a small, parameter-efficient module that transforms the input data before feeding it into the pre-trained model. The output of the pre-trained model is then mapped to the target labels via a fixed label mapping layer. Only the parameters of the input transformation module are trained.

The paper explores three NMR methods:

Input-independent NMR (II-NMR): This is a baseline based on previous work. It involves adding a universal, trainable parameter $\theta$ directly to the input waveform $x$ : $x' = x + \theta$ . This parameter $\theta$ is learned during training and applied identically to all input samples. The paper finds this method performs poorly for complex music signals.
Input-dependent NMR (ID-NMR): Recognizing the complexity of music signals compared to simpler audio like speech commands, the authors propose making the transformation dependent on the input sample and applying it to the input features (like Mel-spectrograms) rather than the raw waveform. This is implemented using a trainable, non-linear transformation function $T$ , composed of convolutional layers, applied to the input feature map $X$ : $X' = T(X)$ . This function learns to add sample-specific "noise" or modifications to the features.
Input-dependent NMR with Skip Connection (IDS-NMR): Building on ID-NMR, this method applies the trainable transformation not to the input features, but to the output of a middle layer $v$ of the pre-trained model. The transformed feature $v'$ is then added directly to the classifier layer. This is represented conceptually as transforming an intermediate representation $v$ and adding $v'$ to influence the final prediction. The motivation is to potentially leverage higher-level features from the pre-trained model's intermediate layers and potentially speed up training by bypassing some of the pre-trained model's layers during backpropagation to the transformation module.

Implementation Details:

The general implementation pipeline involves:

An input audio waveform.
A pre-processing step (e.g., Mel-spectrogram extraction) which is fixed.
A trainable transformation module ( $\mathcal{H}$ ) implementing one of the NMR methods (II-NMR, ID-NMR, or IDS-NMR).
A frozen pre-trained deep neural network (DNN).
A frozen classifier layer from the pre-trained model.
A fixed many-to-one label mapping layer, mapping source classes from the pre-trained model to target classes for the genre classification task. For a target genre, the prediction is the average of probabilities from assigned source classes.

For models pre-trained on fixed-length audio inputs (like SpeechATT on 1-second commands or AST on 10-second audio), longer music snippets (30 seconds in GTZAN) are chunked into non-overlapping segments. Each segment is processed independently by the reprogrammed model, and the final class probabilities for the entire snippet are obtained by averaging the probabilities over all segments.

The trainable transformation function $T$ in ID-NMR and IDS-NMR is implemented using convolutional layers, similar to a simple CNN architecture, ensuring it introduces non-linearity and is input-dependent. The number of channels and kernel sizes are roughly tuned to match the complexity of a baseline CNN.

The paper uses two pre-trained models:

SpeechATT: Trained on Google Speech Command dataset (speech-focused, relatively small).
AST (Audio Spectrogram Transformer): Pre-trained on ImageNet and fine-tuned on AudioSet (general audio/vision, large scale). This represents cross-modal transfer.

Practical Applications and Results:

The primary application demonstrated is low-resource music genre classification on the GTZAN dataset. The results show that:

Input-dependent NMR methods (ID-NMR and IDS-NMR) significantly outperform the simple input-independent baseline (II-NMR).
Using a more powerful pre-trained model like AST yields much better results than using SpeechATT, highlighting the importance of the source model's representation power, even if trained on different domains.
The proposed IDS-NMR method with AST achieves performance comparable to or better than several existing transfer learning methods on GTZAN, including those using models pre-trained specifically on large music datasets (like MusiCNN or JukeBox embeddings) and standard fine-tuning of AST (BL-FT-AST).
Crucially, the NMR methods, particularly IDS-NMR, require dramatically fewer trainable parameters and train much faster than fine-tuning the entire pre-trained model. For the AST model, fine-tuning (BL-FT-AST) could not even fit on a single consumer-grade GPU (RTX 2080) for training, while IDS-NMR required only 235k trainable parameters and was significantly faster per epoch than ID-NMR, fitting easily on the GPU.

Implementation Considerations:

Computational Efficiency: NMR is highly advantageous for low-resource compute environments, allowing leverage of very large pre-trained models without needing massive GPU memory or processing power for training. Deployment might also be simpler as only the small transformation module needs to be updated or managed alongside the frozen base model.
Data Efficiency: The method is effective in low-resource data scenarios, demonstrated by its strong performance on the small GTZAN dataset.
Pre-trained Model Selection: The choice of the pre-trained model is critical. AST, pre-trained on AudioSet, performs much better than SpeechATT, suggesting that a powerful general audio representation is beneficial, even more so than a simpler speech-specific one. Cross-modal transfer (vision+audio pre-training of AST) proves effective.
Transformation Architecture: The architecture of the trainable transformation function $T$ needs careful consideration. Simple linear transformations or universal noise are insufficient for complex data like music. Convolutional layers prove effective here.
Label Mapping: The many-to-one label mapping requires defining which source classes correspond to which target classes. This might involve some manual grouping or a search process (the paper mentions searching for $n=2$ or $n=5$ source classes per target class).
Data Preparation: Chunking and padding audio to match the pre-trained model's input requirements is necessary.
Trade-offs: While competitive, NMR might not always surpass highly optimized, task-specific fine-tuning on massive datasets (if they existed for music). However, in the realistic low-resource music scenario, it offers a highly practical and effective alternative.

In summary, this research provides a practical framework for adapting powerful, pre-trained neural networks from potentially different domains to music genre classification tasks with limited data and computational resources, with the proposed input-dependent NMR methods, especially IDS-NMR, showing strong performance and efficiency gains over traditional fine-tuning and representation extraction approaches.

PDF Markdown

Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming (2211.01317v3)

Summary

Related Papers