ACEval: Audio Evaluation Frameworks

Updated 20 January 2026

ACEval is a suite of three rigorous evaluation frameworks that standardize metrics for room acoustics, audio CAPTCHA security, and audio codec performance.
It employs real-world datasets and a range of quantitative metrics—such as bias, MSE, Pearson correlation, and PESQ—to ensure objective, transparent assessments.
The frameworks bridge reproducibility gaps in audio research, enabling actionable comparisons across acoustics, security, and audio coding domains.

ACEval is a term designating three distinct, rigorous evaluation frameworks in modern audio and acoustics research. It refers to: (1) the evaluation suite for the Acoustic Characterization of Environments Challenge (ACE Challenge) (Eaton et al., 2015); (2) the automated evaluation pipeline in the recent AI-CAPTCHA framework for audio CAPTCHA security (Ding et al., 13 Jan 2026); and (3) OpenACE, an open-source benchmark for full-band audio codec comparison (Coldenhoff et al., 2024). These frameworks share a principled focus on objective, transparent, and reproducible evaluation across disparate subdomains: room acoustics, security, and audio coding. Each instantiation of ACEval targets a distinct set of performance metrics, datasets, and methodologies to address critical reproducibility and benchmarking gaps in its respective area.

1. ACEval in the Acoustic Characterization of Environments (ACE) Challenge

ACEval in the ACE Challenge context is a comprehensive evaluation suite for benchmarking blind estimation algorithms of room acoustic parameters—principally reverberation time (T₆₀) and direct-to-reverberant ratio (DRR)—on realistic, multi-channel, noisy recordings (Eaton et al., 2015).

Dataset and Recording Environments

The ACEval framework employs the ACE Challenge dataset consisting of real-room measurements across twelve acoustically diverse environments, including offices, meeting rooms, lecture theatres, and lobbies. Critical aspects:

Reverberation times (T₆₀): 0.2 s to 1.5 s
Noise types: ambient, babble, and fan
SNRs: 18 dB, 12 dB, −1 dB
Microphone arrays: single-channel, stereo, 3–8–32 channel setups

No synthetic RIRs are used in the official evaluation set, ensuring ecological validity.

Definitions and Evaluation Metrics

Key acoustic parameters are defined mathematically:

T₆₀ (Reverberation Time):

$T_{60} = -60 / (20 \times \text{slope of energy decay in dB/s})$

DRR (Direct-to-Reverberant Ratio):

$\mathrm{DRR} = 10\log_{10}\left(\frac{E_{\mathrm{direct}}}{E_{\mathrm{reverb}}}\right)$

ACEval reports:

Bias: Mean signed estimation error

$\mathrm{Bias} = \frac{1}{N}\sum_{n=0}^{N-1}(\hat x_n - x_n)$

Mean Squared Error (MSE):

$\mathrm{MSE} = \frac{1}{N}\sum_{n=0}^{N-1}(\hat x_n - x_n)^2$

Pearson Correlation ( $\rho$ ):

$\rho = \frac{\mathbb{E}\{(\hat X - \mu_{\hat X})(X - \mu_X)\}}{\sigma_{\hat X}\sigma_X}$

Real-Time Factor (RTF):

$\mathrm{RTF} = \frac{\mathrm{computation\ time}}{\mathrm{signal\ duration}}$

Frequency-dependent (subband) estimates are produced for both T₆₀ and DRR.

Algorithmic Submissions and Comparative Analysis

Algorithms submitted to ACEval fall into:

Analytical ± Bias Compensation (ABC)
Single-Feature + Mapping (SFM)
Machine-Learning + Multiple Features (MLMF)

Results demonstrate that subband exploitation and noise compensation modules improve robustness, while spatial array methods enhance DRR estimation. Top performers achieve Pearson correlation up to 0.78 for T₆₀ and up to 0.84 for DRR.

Impact and Limitations

ACEval standardizes reproducible, multi-channel, noise-robust room acoustic evaluation, directly addressing previous overreliance on synthetic or narrowband benchmarks. It exposes failure modes, including SFM method saturation at high T₆₀, and poor low-frequency DRR estimation under low SNR and fan noise. Recommendations emphasize dataset expansion and hybrid ML–analytical approaches for future work (Eaton et al., 2015).

2. ACEval as an Evaluation Suite for Audio CAPTCHA Security

Within the AI-CAPTCHA framework, ACEVAL denotes a unified methodology for evaluating the resilience of audio CAPTCHAs against advanced AI-based solvers (Ding et al., 13 Jan 2026). It enables systematic security assessment and quantifies the human–AI performance gap.

Dual-Solver Evaluation Architecture

ACEVAL comprises two attack pipelines:

LALM-based solver: Directly processes audio via Large Audio LLMs (LALMs), including Qwen-Audio-Chat, SeaLLMs-Audio-7B, and Qwen2-Audio-7B-Instruct. Prompts may use zero-shot or chain-of-thought formats.
ASR-based solver: First transcribes audio challenges using advanced ASR (e.g., GPT-4o-Transcript), then performs downstream reasoning with a text LLM (e.g., GPT-4o).

Both solvers are agnostic to the specifics of the CAPTCHA scheme.

Formal Evaluation Metrics

ACEVAL defines:

Bypass Rate ( $\mathrm{BR}_{S,m}$ ):

$\mathrm{BR}_{S,m} = \frac{\left|\{x \in S:\ \text{solver}\ m\ \text{answers correctly on}\ x\}\right|}{N_S}$

Robustness Score: $1 - \mathrm{BR}_{S,m}$
Human Success Rate ( $\mathrm{SR}_S(k)$ ):

$\mathrm{SR}_S(k) = \frac{\left|\{x \in S:\ x\ \text{solved by human on attempt}\ k\}\right|}{N_S}$

Word Error Rate (WER): Used to analyze ASR performance but not central to bypass evaluation.

Experimental Design

ACEVAL evaluates seven production audio CAPTCHA schemes (Geetest, Google, MTCaptcha, Telephone-Audio, Math, Character, Arkose Labs) using 210 real challenge samples, various LALMs/ASR pipelines, and a balanced human study cohort including visually impaired participants.

Key Findings

Existing CAPTCHAs are susceptible to LALM and ASR-LLM attacks, with $\mathrm{BR}_{S,m}$ as high as 80–100%.
Some schemes are hard for both humans and AI, but many fail to create a meaningful gap in $\mathrm{SR}_S(1)$ versus $\mathrm{BR}_{S,m}$ .
ILLUSIONAUDIO, by contrast, demonstrates $\mathrm{BR}=0$ for all solvers and $\mathrm{SR}_S(1) = 1.0$ , verifying high accessibility and security.
Ablation studies in ACEVAL quantify the contributions of signal postprocessing and reference audio for human priming.

Significance

ACEVAL provides a reproducible, multi-model, human-anchored benchmark for evaluating audio CAPTCHA robustness. Its dual emphasis on bypass and usability rates addresses the core security-accessibility tradeoff (Ding et al., 13 Jan 2026).

3. OpenACE: A Comprehensive Benchmark for Audio Coding Evaluation

OpenACE (“open ACEval,” Editor’s term) is an open-source, end-to-end evaluation benchmark for audio codec performance, addressing limitations of proprietary, narrowband, or non-reproducible evaluations (Coldenhoff et al., 2024).

Data Corpus

OpenACE compiles 5.9 hours of material from seven public, full-band datasets, including:

IEEE 269-2010 test vectors
ETSI TS 103-281 (multilingual, noisy speech)
ITU-T P.501 (10 languages, 8/16/48 kHz)
VCTK and EARS speech corpora
EBU SQAM and the ODAQ music/audio test sets

This coverage enables robust assessment across languages, content types (music, environmental sounds, emotional speech), and sample rates, in contrast to prior art limited to proprietary test vectors.

Workflow Architecture

OpenACE operationalizes codec evaluation in a standardized three-stage workflow:

Data Preparation: Down-mixing to mono, PCM conversion, metric-specific resampling, temporal truncation (10 s for objective; 6 s for subjective tests).
Codec Invocation: Codec configuration (bitrate selection, preprocessing, command-line templates), then systematic encode-decode cycles.
Quality Metric Computation: Automated calculation of:

PESQ (ITU-T P.862):

$\mathrm{PESQ}_{\mathrm{avg}} = \frac{1}{N} \sum_{i=1}^N \mathrm{PESQ}(x_i, y_i)$
POLQA (P.863): MOS-LQO $\in [1.0, 5.0]$
VISQOL Audio v3
SNR:

$\mathrm{SNR} = 10 \log_{10} \frac{\sum_n x[n]^2}{\sum_n (x[n] - y[n])^2}$

Results are logged in CSV, with plotting and MUSHRA subjective test page generation.

Use Cases and Results

Table-based comparisons (e.g., at 16/32/64 kbps) reveal bitrate-dependent quality trade-offs between legacy codecs (EVS, Opus) and new standards (LC3/LC3+). Emotional-speech subjective tests indicate all codecs degrade for high-arousal emotions at low bitrates, suggesting future codecs should model excitation variability.

Extensibility and Democratization

OpenACE supports containerized reference codec implementations, streamlined dataset/codec integration, and full reproducibility. This positions the framework to become an open standard for academic and commercial benchmarking, and to prevent the overfitting or unfair advantage associated with closed datasets.

4. Comparative Overview of Framework Architectures and Metrics

Framework	Target Task	Core Metrics
ACEval (ACE Challenge)	Room acoustics (T₆₀, DRR estimation)	Bias, MSE, Pearson ρ, RTF
ACEVAL (AI-CAPTCHA)	Audio CAPTCHA security	Bypass Rate, SR, WER
OpenACE (Audio Coding)	Audio/speech codec quality benchmarking	PESQ, POLQA, VISQOL, SNR

All three frameworks emphasize open access, multi-system comparability, and careful metric definition. Metrics are parameterized for the characteristics of their evaluation targets—acoustic parameters, adversarial robustness, or perceptual quality—enabling system-level benchmarking tailored to the demands of their respective domains.

5. Limitations, Common Challenges, and Future Directions

Each ACEval framework confronts domain-specific challenges:

ACE Challenge: Underrepresentation of extreme (T₆₀ > 1.5 s) or outdoor environments; limited by non-simulated RIRs in the eval set; performance degrades at low SNR and for certain noise types (fan, babble).
ACEVAL for CAPTCHAs: Model capabilities are rapidly evolving; real-world deployment may differ from laboratory challenge settings; human accessibility must be validated continually, especially for new perceptual mechanisms.
OpenACE: Objective metrics (e.g., VISQOL) do not always perfectly predict human perceptual judgments; high-arousal emotional speech remains a systematic challenge for all codecs at constrained bitrates.

A plausible implication is that future ACEval evolutions in all three domains will require cross-modal hybrid evaluation protocols, expanded datasets representing real-world variance, and tighter coupling of objective and subjective quality metrics.

6. Significance for the Research Ecosystem

The ACEval family of frameworks represents a trend towards open, reproducible, and extensible benchmarks in audio technology and acoustics, filling reproducibility gaps and enabling rigorous, system-level comparison under realistic, sometimes adversarial, conditions. By formalizing metrics and workflows—such as Bypass Rate and Human Success Rate in security, or MSE and Pearson ρ in acoustics—ACEval creates a shared technical foundation for research and industry alike. This supports reproducible science, fair cross-system benchmarking, and progress toward more robust, perceptually aligned, and secure audio technologies (Eaton et al., 2015, Coldenhoff et al., 2024, Ding et al., 13 Jan 2026).

Markdown Upgrade to Chat

References (3)

Acoustic Characterization of Environments (ACE) Challenge Results Technical Report (2015)

Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances (2026)

OpenACE: An Open Benchmark for Evaluating Audio Coding Performance (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ACEval.