MATPAC++: MCL for Audio Representation

Updated 19 August 2025

MATPAC++ is a self-supervised audio representation learning framework that leverages Multiple Choice Learning to predict ambiguous masked patches.
It employs an ambiguity-aware Winner-Takes-All loss and multi-hypothesis predictor architecture to enhance reconstruction and classification performance.
Achieving state-of-the-art results on both music-specific and general audio tasks, MATPAC++ offers efficiency, robustness, and competitive model complexity.

MATPAC++ is a self-supervised learning framework designed to advance the state-of-the-art in audio representation learning, notably by improving masked latent prediction via integration of Multiple Choice Learning (MCL). It builds upon the foundational MATPAC system, focusing on explicit modeling of prediction ambiguity in audio sequences—particularly in polyphonic and multi-instrumental contexts—through a multi-hypothesis predictor architecture. MATPAC++ achieves superior performance across linear probing and fine-tuning protocols on both general audio (e.g., AudioSet) and music-specialized tasks, demonstrating efficiency and robustness with competitive model complexity compared to existing large-scale alternatives (Quelennec et al., 18 Aug 2025).

1. Objectives and Core Innovations

MATPAC++ addresses limitations in prior masked latent prediction approaches by introducing explicit ambiguity modeling within the predictor module. The principal innovations include:

Multiple Choice Learning (MCL) in Masked Prediction: Standard self-supervised audio models (including MATPAC) utilize a single-head predictor for reconstructing masked latent representations. MATPAC++ replaces this with $r$ distinct linear projection heads, allowing the prediction of multiple plausible hypotheses for the same masked patch.
Ambiguity-aware Winner-Takes-All Loss: Instead of enforcing a single prediction, an annealed soft-assignment (Winner-Takes-All with temperature scheduling) is used to select and refine the best-matching hypothesis for each masked patch during training.
Integration with Unsupervised Classification: The chosen hypothesis for each masked patch is then used in the unsupervised classification pretext task, mapping latent representations to probability distributions and aligning them via cross-entropy to the teacher's outputs.

This design is motivated by the intrinsically ambiguous nature of audio signals, wherein multiple overlapping sources may yield more than one plausible reconstruction for a given input region.

2. Multiple Choice Learning (MCL): Predictor Architecture and Training

MCL is implemented in MATPAC++ by equipping the predictor module with $r$ output heads, each generating a candidate latent vector for the masked patch $i$ :

Formulation and Loss Computation:

For each masked patch $i$ and prediction head $j$ :

$d^{(j)}(i) = \| \hat{z}_m^{\prime(j)}(i) - z_m^{\prime}(i) \|^2$

where $\hat{z}_m^{\prime(j)}$ and $z_m^{\prime}$ denote $l_2$ -normalized predicted and ground truth latent vectors.

Soft assignment weights are computed using a temperature-annealed softmax:

$b(i) = \operatorname{Softmax}\left( - \frac{ (d^{(1)}(i), \ldots, d^{(r)}(i)) }{ \tau_{MCL} } \right)$

Final MCL prediction loss:

$\mathcal{L}_{\text{pred}}(Z_m, \hat{Z}_m^{(1)}, \ldots, \hat{Z}_m^{(r)}) = \sum_{i=1}^N \sum_{j=1}^r b(i) d^{(j)}(i)$

The temperature parameter $\tau_{MCL}$ is annealed during training to encourage diversity initially and select sharper assignments later.

Best Hypothesis Selection for Classification:

For the unsupervised classification task, the hypothesis minimizing $d^{(j)}(i)$ for each patch is mapped into probability distributions via a linear classification head.

Technical Details:

The predictor input includes a learnable mask token $m$ and positional embedding $p'$ , flows through transformer layers, and branches into the $r$ heads. Teacher encoder updates and output centering use exponential moving averages (EMA) as in the original MATPAC (Quelennec et al., 17 Feb 2025).

3. Evaluation Methodology and Empirical Performance

MATPAC++ is evaluated using two primary methodological paradigms:

Linear Probing: A post-hoc linear classifier is trained on temporally averaged encoder embeddings from multiple datasets:
- Music and Instrument Datasets: OpenMIC, NSynth, GTZAN, MTG, Magna-tag-a-tune.
- General Audio: ESC-50, FSD50K, US8K.
- Linear probing measures transferability and the discriminative capacity of learned representations.
Fine-Tuning Protocol: The full MATPAC++ encoder is fine-tuned (supervised) on AudioSet. Evaluations include:
- Patchout Augmentation: Masking parts of the audio input as in PaSST protocol.
- Full-Input Fine-Tuning.

The results indicate:

State-of-the-Art Downstream Performance: MATPAC++ achieves top scores across music and generic audio tasks. For example, fine-tuning with patchout yields 48.09 mAP on AudioSet—outperforming prior SSL methods and matching/exceeding specialized models (e.g., MERT, MULE) with lower parameter counts (~86M vs. 300M+).
Efficiency and Robustness: MATPAC++ maintains strong results for both linear probing and full fine-tuning, with reduced model complexity.

4. Domain Specialization and Comparative Results

MATPAC++ supports both generalist (AudioSet-trained) and music-specialist (Million Song Dataset + WebRadio pre-training) configurations:

Music-Specific Training: Yields enhanced accuracy in genre and instrument recognition (e.g., up to 79% on NSynth, improved mAP on OpenMIC, and superior auto-tagging), outperforming contemporaneous models even those explicitly designed for music.
Generalization Across Domains: Maintains competitive performance for environmental sound datasets, with improvements attributed to the richer ambiguity-aware representations enabled by MCL.

Efficiency, both computational (parameters, training throughput) and in transfer, is documented as a prominent advantage for MATPAC++ over larger prior architectures.

5. Technical Formulation: Predictor, Loss, and Training Procedures

MATPAC++ training is governed by a joint objective:

$\mathcal{L} = (1 - \alpha) \mathcal{L}_{\text{cls}} + \alpha \mathcal{L}_{\text{pred}}$

where:

$\mathcal{L}_{\text{pred}}$ is the annealed Winner-Takes-All MCL loss as described above.
$\mathcal{L}_{\text{cls}}$ is a cross-entropy between probability distributions obtained from the best hypothesis and those output by the teacher (after temperature sharpening and centering via EMA updates).
$\alpha$ tunes the trade-off between pretext objectives.

The use of soft assignment and temperature annealing in MCL is crucial to avoid premature convergence to suboptimal hypotheses, ensuring continued exploration and specialization.

The teacher-student setup, including EMA updates for both model weights and output centering, ensures stability of training and non-collapse of cluster identities in the unsupervised classification component.

6. Impact, Applications, and Implications

MATPAC++ provides a generic and scalable approach for self-supervised audio representation learning:

Applicability: Suitable for music information retrieval, instrument/genre classification, auto-tagging, audio event detection, and environmental sound classification.
Transfer and Specialization: Effective in both generalist and domain-specialized regimes, supporting multi-modal extension and efficient adaptation.
Architectural Efficiency: Empirical results demonstrate competitive or superior scores with lower parameter counts and enhanced model throughput—underscoring practical advantages in deployment and real-time inference.

A plausible implication is that future development could extend MCL-based masked prediction frameworks to additional modalities (e.g., vision, text), given their ability to address inherent ambiguities in structured prediction tasks.

7. Concluding Remarks

MATPAC++ constitutes an advancement in self-supervised audio representation learning by explicitly modeling one-to-many prediction ambiguities with Multiple Choice Learning. Its empirical superiority across benchmark datasets, combined with an efficient technical implementation, designates MATPAC++ as a current reference approach for both music-centric and general audio SSL settings (Quelennec et al., 18 Aug 2025). The architectural principles developed in MATPAC++ suggest broader applicability to domains where ambiguous or multi-modal predictions are integral to robust representation learning.

PDF Markdown Chat (Pro)

References (2)

MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning (2025)

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning (2025)

Follow Topic

Get notified by email when new papers are published related to MATPAC++.