Codec-SUPERB: Neural Sound Codec Benchmark

Updated 23 October 2025

Codec-SUPERB is an open-source benchmarking framework that systematically evaluates neural sound codecs using both signal-level and application-level metrics.
It employs detailed measures such as PESQ, STOI, and F0CORR alongside ASR, ASV, and emotion recognition to capture performance across diverse audio scenarios.
The framework fosters community collaboration with a modular codebase, curated datasets, and a public leaderboard for transparent, reproducible comparisons.

Codec-SUPERB is a standardized and open-source benchmarking framework designed to comprehensively evaluate neural sound codec models across both signal-level and application-level dimensions. It serves as an ecosystem enabling rigorous, reproducible, and fair comparison of diverse codecs in representation, preservation, and practical downstream tasks. Codec-SUPERB incorporates detailed perceptual, spectral, and task-based metrics, provides curated datasets spanning speech, music, and general audio, and facilitates collaborative progress through a public leaderboard and extensible codebase.

1. Framework Architecture and Scope

Codec-SUPERB, the “Codec Sound processing Universal PERformance Benchmark,” encompasses a multi-dimensional evaluation suite for neural sound codecs (Wu et al., 20 Feb 2024). Its design recognizes the dual roles of codecs: minimizing data transmission latency and serving as tokenizers for speech-language modeling and generative AI workflows. The framework assesses codec models across 20 curated datasets, including speech, general audio, and music, representing a broad array of sound types. Six codec families (yielding 19 unique models) are included, supporting fair, comparative analysis across deployment scenarios.

The evaluation explicitly measures the codec’s ability to preserve content, speaker characteristics, paralinguistic features, and audio details at varying bitrates (e.g., 2–24 kbps). The system is built for extensibility, allowing new codecs, metrics, and datasets to be added modularly.

2. Evaluation Metrics and Methodology

Codec-SUPERB establishes five core signal-level metrics:

STFTDistance: Assesses multi-scale frequency domain divergence via Short-Time Fourier Transform.
MelDistance: L1 loss between log Mel spectrograms, capturing spectral/timbral fidelity.
PESQ (Perceptual Evaluation of Speech Quality): Models human-perceived speech quality.
STOI (Short-Time Objective Intelligibility): Quantifies speech intelligibility in noise.
F0CORR: Pearson correlation coefficient for fundamental frequency (F0), measuring pitch preservation.

To produce a comprehensive ranking, these metrics are min-max normalized (sigmoid normalization for unbounded measures), inverted where appropriate (e.g., lower Mel/STFT distances imply higher performance), and aggregated via harmonic mean:

$\text{Overall Score} = \operatorname{HM}(\text{normalized\, PESQ},\, \text{normalized\, STOI},\, \text{normalized\, F0CORR},\, \text{twist(normalized\, STFTDistance)},\, \text{twist(normalized\, MelDistance)})$

Application-level assessments employ pre-trained models to test codec effect on:

Automatic Speech Recognition (ASR): (Word Error Rate; WER)
Automatic Speaker Verification (ASV): (Equal Error Rate; EER, minDCF)
Emotion Recognition (ER): (Accuracy; ACC)
Audio Event Classification (AEC): (Accuracy on AudioSet)

Each codec’s encoder-decoder is tested by resynthesizing inputs from the entire curated dataset suite; output signals are evaluated against the original with respect to both signal-level metrics and downstream task performance. This dual methodology captures subtleties missed in signal-only evaluations, e.g., content or speaker identification resilience after compression.

3. Insights from Codec Analysis

Empirical analysis within Codec-SUPERB reveals strong positive correlations between Overall Score and PESQ, STOI, F0CORR; STFTDistance and MelDistance are negatively correlated. This validates the harmonic mean aggregation as a comprehensive indicator for signal preservation.

Application-level results establish quantitative trade-offs:

High-bitrate codecs (e.g., DAC at 6–24 kbps) better retain speech content and speaker/emotion/event fidelity, evidenced by low WER/EER and high ER/AEC accuracy.
Low-bitrate codecs (such as AcademiCodec at 2–3 kbps) excel in minimal bitrate environments but may sacrifice finer signal detail.

This integrated analysis helps elucidate which architectures are best suited under operational constraints, such as latency-critical conferencing or high-fidelity generative modeling. Notably, scenarios exist where signal-level metrics alone do not capture losses in semantic, linguistic, or paralinguistic content—demonstrating the need for Codec-SUPERB’s application-aware approach.

4. Benchmarking, Leaderboard, and Community Collaboration

Codec-SUPERB is rooted in community collaboration:

The codebase is open-source and modular, permitting seamless integration of new codecs and evaluation criteria.
The online leaderboard enables result submission and public comparison; researchers upload codec outputs and the platform computes standardized metrics for transparent benchmarking.
The database expands dynamically as new models and results are added, with built-in visualization/statistical tools facilitating meta-analysis.

Datasets, model interfaces (the “base_codec” abstraction), and scoring logic are designed for reproducibility and extension, accelerating codec innovation cycles. The separation of codec implementation from evaluation logic promotes generalizability.

5. The Codec-SUPERB Challenge (SLT 2024) and Lightweight Benchmarking

Codec-SUPERB underpins the “Codec-SUPERB @ SLT 2024” challenge (Wu et al., 21 Sep 2024), which targets lightweight, training-free benchmarking. Representative codec models are submitted and assessed on both open and hidden test sets using strictly license-free corpora, minimizing computational cost and experimental variance.

Key application-level metrics include ASR-WER, ASV-EER, ER-ACC, and AEC accuracy. Signal-level metrics comprise PESQ, STOI, SDR, and MelLoss. Five models—FunCodec, SemantiCodec, APCodec, AFACodec, SpeechTokenizer—are evaluated across bitrates from 0.34–8 kbps.

Findings demonstrate AFACodec’s superior mid-bitrate (7–8 kbps) performance, with near-original ER/ACC and low WER/EER, while SemantiCodec maintains strong information retention at ultra-low bitrates. Correlation analysis in Table 4 of (Wu et al., 21 Sep 2024) confirms STOI and PESQ as robust predictors of application-level outcomes.

6. Implications, Impact, and Future Directions

Codec-SUPERB advances the state-of-the-art in audio codec benchmarking by establishing rigorous, standardized, and transparent evaluation pipelines. The dual focus on signal-level fidelity and downstream task performance is set to become the reference methodology for codec research.

Future enhancements include:

Multilingual evaluation expansion—benchmarking across additional languages and accents to account for codec generalization.
Accelerated sample selection for evaluation efficiency.
Improved semantic tokenization and integration of advanced generative and diffusion models.
Deeper analysis of trade-offs between bitrate, model size, latency, and application-specific performance.

The Codec-SUPERB framework, through its in-depth analytic approach and collaborative infrastructure, is positioned to drive reproducible research, inform codec design decisions, and facilitate integration into diverse audio and speech processing pipelines.

7. Technical Summary Table

Metric	Signal-Level Function	Application Metric(s)
STFTDistance	Frequency content via STFT	—
MelDistance	Log-Mel spectral timbre	—
PESQ	Perceptual speech quality	ASR-WER, ASV-EER, ER-ACC, AEC
STOI	Speech intelligibility	ASR-WER, ER-ACC
F0CORR	Pitch preservation	—

This statistical rigor, holistic scope, and open community infrastructure establish Codec-SUPERB as a pivotal resource for neural codec research, enabling precise assessment across competing codecs and guiding future innovations in compression, generative modeling, and multilingual audio systems.

PDF Markdown Chat (Pro)

References (2)

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models (2024)

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models (2024)

Follow Topic

Get notified by email when new papers are published related to Codec-SUPERB.