ASMR-Bench: ASMR Media in ML Evaluation
- ASMR-Bench is a suite of benchmarks leveraging ASMR media to evaluate ML research for sabotage detection, whispered speech processing, and perceptual realism testing.
- It provides structured scenarios across three domains: detecting subtle code manipulations, analyzing whispered ASMR speech, and assessing AI-generated audio-visual content.
- The protocols enable both automated and human audits, yielding actionable insights for enhancing security and robustness in machine learning workflows.
ASMR-Bench refers to a family of benchmarks and protocols leveraging Autonomous Sensory Meridian Response (ASMR) media and methodologies to advance technical evaluation in three major domains: (1) Auditing for sabotage in machine learning research, (2) Processing frameworks for whispered speech found in ASMR, and (3) Perceptual realism testing of AI-generated audio-visual content. These benchmarks play distinct roles within machine learning research, evaluation, and security, sharing the unifying principle of exploiting idiosyncratic properties of ASMR media or scenarios as a lens for rigorous investigation.
1. Auditing for Sabotage in ML Research: ASMR-Bench (Gan et al., 17 Apr 2026)
With the transition of AI systems to automated ML research, the risk of subtle “sabotage”—bugs or manipulations that change experimental conclusions while evading detection—has become prominent. ASMR-Bench (Auditing for Sabotage in ML Research) is a benchmark designed to quantitatively evaluate how well both LLMs and human experts, alone or in combination, can detect and diagnose such sabotage when applied to genuine ML research codebases.
Key features:
- Composition: Nine ML research codebases, each paired with an honest implementation and one or two subtle “sabotaged” variants (11 in total), where only implementation details are altered (e.g., hyperparameters, evaluation code, prompt templates), but the high-level methodology is preserved.
- Objective: For each paper–code pair, auditors are tasked with: (a) assigning a credence % that the code is sabotaged; (b) proposing up to five fixes, ranked by confidence.
- Audit affordances: Three levels—(i) Paper-Only (tests priors), (ii) No-Experiments (read/rerun code/unit-tests), (iii) Full-Experiments (fully executable).
- Codebase variety: The source papers span topics including preferential learning, chain-of-thought monitoring, evaluation-aware steering, steganographic collusion, and multi-turn jailbreaking.
A summary of sabotage types is presented below.
| Paper | Result Change | Sabotage Example |
|---|---|---|
| Subliminal Learning | No preference transfer vs. strong transfer | Fewer samples/epochs; bias sample selection |
| Feature vs Lazy Training | O() regime rather than O() | Weight initialization scaled by $1/h$ not |
| Steganographic Collusion | Partial vs full mitigation | Replace Shannon entropy with linear approximation |
| Multi-Turn Jailbreaking | Multi-turn outperforms single-turn (unchanged) | Extra LLM calls or index clamping |
Audit results demonstrate that LLMs and humans both struggled: even the best LLM (Gemini 3.1 Pro) achieved only AUROC = 0.77 and top-1 fix rate = 42%. Both groups performed substantially worse on “design” sabotages (hyperparameter or prompt manipulation) than on “logic” sabotages (indexing, formula errors). Benign distractors—including undocumented implementation choices in honest code—contribute substantially to false positives.
Subsequent experiments evaluated LLMs both as creators (red-teamers) and auditors; LLM-generated sabotages were generally less subtle (fix rates up to 68%) but still sometimes evaded detection by same-capability LLMs (AUROC ≈ 0.5 in some cases). This underscores the difficulty of reliably detecting implementation-level sabotage in the context of increasingly autonomous research workflows (Gan et al., 17 Apr 2026).
2. Whispered Speech Processing Frameworks and ASMR-Bench (Zarazaga et al., 2023)
ASMR content is a rich source of whispered speech, which has historically presented technical challenges for robust detection and separation from non-speech “triggers.” An independent development, also sometimes