Audio-Physics Sensitivity Test (APST)

Updated 7 January 2026

APST is a systematic evaluation framework that quantifies sensitivity to controlled audio-physical variations in both models and human listeners.
It employs paired conditions and metrics such as CPRS, GMcos, and CPM to rigorously benchmark perceptual and computational responses.
The framework is applied in generative model testing, clinical audiology, and sensory studies to provide actionable, quantifiable insights.

The Audio-Physics Sensitivity Test (APST) is a rigorously defined set of paradigms and computational procedures that quantify sensitivity—of models, humans, and objective metrics—to physically interpretable audio parameter variations. APST serves as a core methodology for probing the perceptual, computational, or generative alignment between variable physical conditions (e.g., material changes, noise levels, synthesis parameters) and the corresponding audio output or signal representation, across model and human domains. Its formalizations encompass psychophysics, deep feature-space geometry, and functional evaluations for large-scale generative models, establishing a general framework for audio-physics benchmarking.

1. Conceptual Foundations and Definitions

APST was originally articulated as a paradigm to systematically evaluate whether a given system (biological or computational) displays sensitivity to controlled physical manipulations underlying audio signals. In recent work on text-to-audio-video generation and deep audio feature analysis, APST is defined as:

An evaluation framework in which a single physical variable is controlled between paired (or grouped) conditions—such as changing the size of an object, the length of a cavity, or the reverberation time—thereby inducing a predictable, physically grounded direction of change in sound (Xie et al., 30 Dec 2025, Deng et al., 27 Jan 2025).
The task is then to measure whether a model, metric, or human listener can detect, align with, or reproduce this direction of change: for generative models, this involves the Contrastive Physical Response Score (CPRS), and for objective metrics, sensitivity is assessed by monotonicity and consistency with human perceptual rankings (Gupta et al., 2022, Pillonetto et al., 2024).
In human psychophysics, the term is generalized to encompass quantitative comparisons of detection thresholds or discrimination under physical stimulus manipulations, as in cross-modal or disturbance sensitivity studies (Hudin et al., 2020, Pillonetto et al., 2024).

2. APST Protocols in Generative and Evaluative Benchmarks

Construction and Scenarios

In generative model evaluation, APST is instantiated as the core protocol in PhyAVBench (Xie et al., 30 Dec 2025):

Test Scenario Design: For each of 50 test points drawn from six major audio-physics dimensions, curated text prompts are generated that differ in only one physically controlled variable (e.g., "knock on a wooden door" vs. "knock on a metal door"), with all other conditions held constant.
Data Curation: Each prompt is paired with at least 20 new, real-world video recordings, avoiding pre-training data leakage.
Scenarios: APST covers music, sound effects, speech, and mixtures, ensuring ecological validity.

Physical Dimensions

The test points are organized into a hierarchical taxonomy:

Dimension	Subdomains (Examples)	Representative Phenomena
Sound Source Mechanics	Material, geometry, contact, structure	Hardness, size, tension
Fluid and Aerodynamics	Volume, resonance, viscosity	Flow rate, Helmholtz resonance
Sound Propagation Environment	Reverberation, occlusion, diffraction	Room size, diffraction, underwater
Observer Physics	Distance, Doppler, binaural cues	Inverse-square law, localization
Time and Causality	Delay, rhythm, synchrony	Far-field delays, periodicity
Extreme and Coupled Physics	Phase transition, explosion, coupling	Boiling, shock waves, Tesla coil

3. Objective and Perceptual Metric Sensitivity

Deep Feature and Physics-Sensitive Distances

APST systematically characterizes the sensitivity of objective audio metrics to controlled parameter changes—essential for both evaluation and model diagnosis (Gupta et al., 2022):

Metrics: Includes Gram-matrix metrics (GM, GMcos), Accumulated Gram (AGM), and cochlear-parametric metrics (CPM), as well as FAD and L2.
Parameter Sweep: For each texture or audio class, one sound-structure or synthesis parameter is swept over a defined range (e.g., pitch, rate, event density); instances at each level are randomly resynthesized to capture intra-level variability.
Sensitivity Curves: Metric sensitivity is quantified by correlating distance curves (metric vs. parameter value) to ground-truth or human judgments, typically using Pearson’s $r$ as a monotonicity measure.

Practical Interpretation

GMcos and AGM display highest sensitivity to rate-like, impulsive changes (event frequency).
CPM excels in tracking spectral or envelope shifts (e.g., pitch, resonance).
No single metric captures all physical transformations: APST therefore recommends a complementary "suite" of metrics selected according to the physical parameter under investigation.

4. Feature-Space Probing in Foundation Audio Models

APST has been extended to probe the geometry of feature space in large pre-trained audio models (OpenL3, PANNs, CLAP) (Deng et al., 27 Jan 2025):

Parameterized Effect Application: For an input $x[n]$ , effects such as gain ( $\alpha$ in dB), low-pass filtering (cutoff frequency), reverberation (room size), and bitcrushing (bit depth) are systematically modulated.
Embedding Trajectories: Embeddings of the original and effected signals are computed: displacement vectors $\mathbf{d}(\alpha) = \mathbf{e}(\alpha) - \mathbf{e}_0$ are tracked over effect strength.
Canonical Correlation Analysis (CCA): The direction in embedding space most aligned with variation in $\alpha$ $α$ is identified via CCA, which also yields an intrinsic dimensionality spectrum. Key findings:
- Embedding deformation induced by effects shows a dominant (monotonic) axis but is globally high-dimensional.
- Projecting out this CCA-derived direction marginally affects downstream robustness; linear desensitization is insufficient to neutralize effect sensitivity in embeddings.

5. APST in Human and Clinical Psychophysics

APST is deployed as a quantitative protocol for measuring perceptual thresholds or discrimination under physical or noise manipulations (Hudin et al., 2020, Pillonetto et al., 2024):

Cross-Modal Thresholds: APST normalizes auditory and tactile thresholds onto a common physical scale (displacement), enabling direct comparison via plane-wave (hearing) and baffled-piston (touch) models.
- Key relationships:
- Auditory: $\delta_{\rm aud}(f) = \frac{P_{\rm thr}(f)}{Z\,2\pi f}$ .
- Tactile: $P(r,f) = \frac{1}{2}\frac{\rho a^2}{r}(2\pi f)^2 U_p$ .
Clinical Sensitivity Functions: In clinical audiology applications (e.g., ASD diagnosis), APST operationalizes disturbance sensing as the SNR at which 50% correctness is achieved, tracked over levels of external noise as a Threshold-versus-Noise (TvN) curve. Both the slope and intercept (related to internal noise) serve as diagnostic markers (Pillonetto et al., 2024):
- ASD individuals display higher internal noise and steeper TvN slopes compared to controls.
- Signal-processing pipelines leveraging harmonic feature estimation and gammatone filtering further quantify interaction between noise sensitivity and speech harmonics.

6. Evaluation Metrics and Scoring

For foundation model and generative T2AV evaluation, APST introduces the Contrastive Physical Response Score (CPRS) (Xie et al., 30 Dec 2025):

Given prompt pair $P_a \rightarrow P_b$ $P_{a} \to P_{b}$ , define:
1. Compute ground-truth and generated mean embeddings $e_{a,\mathrm{GT}}, e_{b,\mathrm{GT}}, v_{\mathrm{GT}}$ (direction in embedding space), $v_{\mathrm{gen}}$ (generated direction).
2. CPRS $= \frac{1}{2} \left[ \frac{v_{\mathrm{gen}} \cdot v_{\mathrm{GT}}}{\|v_{\mathrm{gen}}\|\|v_{\mathrm{GT}}\|} + 1 \right ]$ .
3. CPRS $=1$ : perfect physical alignment, $=0.5$ uncorrelated, $=0$ anti-aligned.
Additional scores such as FGAS, CLAP, CLIPSIM, FAD, WER, and PR-MOS address synchronization, semantic consistency, audio quality, and human-rated physical rationality.

7. Practical Implementation and Guidelines

A canonical APST protocol for arbitrary sound classes or model types involves:

Isolating and varying a single physically meaningful parameter over a defined range with dense intermediate levels.
Generating diverse random instances at each level (at least 8–10).
Applying multiple metrics or embedding analyses with complementary sensitivities (e.g., GMcos for temporal, CPM for spectral).
Quantifying consistency and sensitivity using monotonicity statistics with perceptual validation.
For human studies, implementing adaptive 2-AFC staircases or method of constant stimuli with detailed calibration and psychometric fitting.

For benchmarking generative models, APST ensures that only physically grounded, contrastive phenomena are assessed, forcing models to move beyond replicating data priors and instead demonstrate genuine sensitivity to physical law (Xie et al., 30 Dec 2025).

References:

(Gupta et al., 2022, Pillonetto et al., 2024, Deng et al., 27 Jan 2025, Xie et al., 30 Dec 2025, Hudin et al., 2020)