ACEval: Audio Evaluation Frameworks
- ACEval is a suite of three rigorous evaluation frameworks that standardize metrics for room acoustics, audio CAPTCHA security, and audio codec performance.
- It employs real-world datasets and a range of quantitative metrics—such as bias, MSE, Pearson correlation, and PESQ—to ensure objective, transparent assessments.
- The frameworks bridge reproducibility gaps in audio research, enabling actionable comparisons across acoustics, security, and audio coding domains.
ACEval is a term designating three distinct, rigorous evaluation frameworks in modern audio and acoustics research. It refers to: (1) the evaluation suite for the Acoustic Characterization of Environments Challenge (ACE Challenge) (Eaton et al., 2015); (2) the automated evaluation pipeline in the recent AI-CAPTCHA framework for audio CAPTCHA security (Ding et al., 13 Jan 2026); and (3) OpenACE, an open-source benchmark for full-band audio codec comparison (Coldenhoff et al., 2024). These frameworks share a principled focus on objective, transparent, and reproducible evaluation across disparate subdomains: room acoustics, security, and audio coding. Each instantiation of ACEval targets a distinct set of performance metrics, datasets, and methodologies to address critical reproducibility and benchmarking gaps in its respective area.
1. ACEval in the Acoustic Characterization of Environments (ACE) Challenge
ACEval in the ACE Challenge context is a comprehensive evaluation suite for benchmarking blind estimation algorithms of room acoustic parameters—principally reverberation time (T₆₀) and direct-to-reverberant ratio (DRR)—on realistic, multi-channel, noisy recordings (Eaton et al., 2015).
Dataset and Recording Environments
The ACEval framework employs the ACE Challenge dataset consisting of real-room measurements across twelve acoustically diverse environments, including offices, meeting rooms, lecture theatres, and lobbies. Critical aspects:
- Reverberation times (T₆₀): 0.2 s to 1.5 s
- Noise types: ambient, babble, and fan
- SNRs: 18 dB, 12 dB, −1 dB
- Microphone arrays: single-channel, stereo, 3–8–32 channel setups
No synthetic RIRs are used in the official evaluation set, ensuring ecological validity.
Definitions and Evaluation Metrics
Key acoustic parameters are defined mathematically:
- T₆₀ (Reverberation Time):
- DRR (Direct-to-Reverberant Ratio):
ACEval reports:
- Bias: Mean signed estimation error
- Mean Squared Error (MSE):
- Pearson Correlation ():
- Real-Time Factor (RTF):
Frequency-dependent (subband) estimates are produced for both T₆₀ and DRR.
Algorithmic Submissions and Comparative Analysis
Algorithms submitted to ACEval fall into:
- Analytical ± Bias Compensation (ABC)
- Single-Feature + Mapping (SFM)
- Machine-Learning + Multiple Features (MLMF)
Results demonstrate that subband exploitation and noise compensation modules improve robustness, while spatial array methods enhance DRR estimation. Top performers achieve Pearson correlation up to 0.78 for T₆₀ and up to 0.84 for DRR.
Impact and Limitations
ACEval standardizes reproducible, multi-channel, noise-robust room acoustic evaluation, directly addressing previous overreliance on synthetic or narrowband benchmarks. It exposes failure modes, including SFM method saturation at high T₆₀, and poor low-frequency DRR estimation under low SNR and fan noise. Recommendations emphasize dataset expansion and hybrid ML–analytical approaches for future work (Eaton et al., 2015).
2. ACEval as an Evaluation Suite for Audio CAPTCHA Security
Within the AI-CAPTCHA framework, ACEVAL denotes a unified methodology for evaluating the resilience of audio CAPTCHAs against advanced AI-based solvers (Ding et al., 13 Jan 2026). It enables systematic security assessment and quantifies the human–AI performance gap.
Dual-Solver Evaluation Architecture
ACEVAL comprises two attack pipelines:
- LALM-based solver: Directly processes audio via Large Audio LLMs (LALMs), including Qwen-Audio-Chat, SeaLLMs-Audio-7B, and Qwen2-Audio-7B-Instruct. Prompts may use zero-shot or chain-of-thought formats.
- ASR-based solver: First transcribes audio challenges using advanced ASR (e.g., GPT-4o-Transcript), then performs downstream reasoning with a text LLM (e.g., GPT-4o).
Both solvers are agnostic to the specifics of the CAPTCHA scheme.
Formal Evaluation Metrics
ACEVAL defines:
- Bypass Rate ():
- Robustness Score:
- Human Success Rate ():
- Word Error Rate (WER): Used to analyze ASR performance but not central to bypass evaluation.
Experimental Design
ACEVAL evaluates seven production audio CAPTCHA schemes (Geetest, Google, MTCaptcha, Telephone-Audio, Math, Character, Arkose Labs) using 210 real challenge samples, various LALMs/ASR pipelines, and a balanced human study cohort including visually impaired participants.
Key Findings
- Existing CAPTCHAs are susceptible to LALM and ASR-LLM attacks, with as high as 80–100%.
- Some schemes are hard for both humans and AI, but many fail to create a meaningful gap in versus .
- ILLUSIONAUDIO, by contrast, demonstrates for all solvers and , verifying high accessibility and security.
- Ablation studies in ACEVAL quantify the contributions of signal postprocessing and reference audio for human priming.
Significance
ACEVAL provides a reproducible, multi-model, human-anchored benchmark for evaluating audio CAPTCHA robustness. Its dual emphasis on bypass and usability rates addresses the core security-accessibility tradeoff (Ding et al., 13 Jan 2026).
3. OpenACE: A Comprehensive Benchmark for Audio Coding Evaluation
OpenACE (“open ACEval,” Editor’s term) is an open-source, end-to-end evaluation benchmark for audio codec performance, addressing limitations of proprietary, narrowband, or non-reproducible evaluations (Coldenhoff et al., 2024).
Data Corpus
OpenACE compiles 5.9 hours of material from seven public, full-band datasets, including:
- IEEE 269-2010 test vectors
- ETSI TS 103-281 (multilingual, noisy speech)
- ITU-T P.501 (10 languages, 8/16/48 kHz)
- VCTK and EARS speech corpora
- EBU SQAM and the ODAQ music/audio test sets
This coverage enables robust assessment across languages, content types (music, environmental sounds, emotional speech), and sample rates, in contrast to prior art limited to proprietary test vectors.
Workflow Architecture
OpenACE operationalizes codec evaluation in a standardized three-stage workflow:
- Data Preparation: Down-mixing to mono, PCM conversion, metric-specific resampling, temporal truncation (10 s for objective; 6 s for subjective tests).
- Codec Invocation: Codec configuration (bitrate selection, preprocessing, command-line templates), then systematic encode-decode cycles.
- Quality Metric Computation: Automated calculation of:
- PESQ (ITU-T P.862):
- POLQA (P.863): MOS-LQO
- VISQOL Audio v3
- SNR:
Results are logged in CSV, with plotting and MUSHRA subjective test page generation.
Use Cases and Results
Table-based comparisons (e.g., at 16/32/64 kbps) reveal bitrate-dependent quality trade-offs between legacy codecs (EVS, Opus) and new standards (LC3/LC3+). Emotional-speech subjective tests indicate all codecs degrade for high-arousal emotions at low bitrates, suggesting future codecs should model excitation variability.
Extensibility and Democratization
OpenACE supports containerized reference codec implementations, streamlined dataset/codec integration, and full reproducibility. This positions the framework to become an open standard for academic and commercial benchmarking, and to prevent the overfitting or unfair advantage associated with closed datasets.
4. Comparative Overview of Framework Architectures and Metrics
| Framework | Target Task | Core Metrics |
|---|---|---|
| ACEval (ACE Challenge) | Room acoustics (T₆₀, DRR estimation) | Bias, MSE, Pearson ρ, RTF |
| ACEVAL (AI-CAPTCHA) | Audio CAPTCHA security | Bypass Rate, SR, WER |
| OpenACE (Audio Coding) | Audio/speech codec quality benchmarking | PESQ, POLQA, VISQOL, SNR |
All three frameworks emphasize open access, multi-system comparability, and careful metric definition. Metrics are parameterized for the characteristics of their evaluation targets—acoustic parameters, adversarial robustness, or perceptual quality—enabling system-level benchmarking tailored to the demands of their respective domains.
5. Limitations, Common Challenges, and Future Directions
Each ACEval framework confronts domain-specific challenges:
- ACE Challenge: Underrepresentation of extreme (T₆₀ > 1.5 s) or outdoor environments; limited by non-simulated RIRs in the eval set; performance degrades at low SNR and for certain noise types (fan, babble).
- ACEVAL for CAPTCHAs: Model capabilities are rapidly evolving; real-world deployment may differ from laboratory challenge settings; human accessibility must be validated continually, especially for new perceptual mechanisms.
- OpenACE: Objective metrics (e.g., VISQOL) do not always perfectly predict human perceptual judgments; high-arousal emotional speech remains a systematic challenge for all codecs at constrained bitrates.
A plausible implication is that future ACEval evolutions in all three domains will require cross-modal hybrid evaluation protocols, expanded datasets representing real-world variance, and tighter coupling of objective and subjective quality metrics.
6. Significance for the Research Ecosystem
The ACEval family of frameworks represents a trend towards open, reproducible, and extensible benchmarks in audio technology and acoustics, filling reproducibility gaps and enabling rigorous, system-level comparison under realistic, sometimes adversarial, conditions. By formalizing metrics and workflows—such as Bypass Rate and Human Success Rate in security, or MSE and Pearson ρ in acoustics—ACEval creates a shared technical foundation for research and industry alike. This supports reproducible science, fair cross-system benchmarking, and progress toward more robust, perceptually aligned, and secure audio technologies (Eaton et al., 2015, Coldenhoff et al., 2024, Ding et al., 13 Jan 2026).