Papers
Topics
Authors
Recent
2000 character limit reached

PhyAVBench: T2AV Audio-Physics Evaluation

Updated 7 January 2026
  • PhyAVBench is a large-scale, fine-grained benchmark defined to assess audio-physics sensitivity in T2AV models by isolating single controlled physical variables.
  • The evaluation method uses 1,000 paired prompts with 20+ curated videos per prompt and introduces the Contrastive Physical Response Score (CPRS) to measure physical response directionality.
  • Results indicate that while current T2AV systems achieve high semantic alignment, they often struggle with physically faithful audio generation, highlighting the need for physics-aware architectures.

PhyAVBench is a large-scale, fine-grained benchmark specifically designed to evaluate the audio-physics grounding capabilities of text-to-audio-video (T2AV) generation models. Unlike prior benchmarks that focus on semantic alignment or audiovisual synchronization, PhyAVBench rigorously probes a model’s understanding of the physical processes underlying sound generation and propagation, targeting deficiencies such as failure to correctly render material-dependent timbre, distance attenuation, Doppler shifts, and other acoustically grounded phenomena. The benchmark provides an extensive suite of 1,000 pairs of minimally differing prompts, each paired with at least 20 newly recorded or carefully curated real-world videos, and employs dedicated metrics such as the Contrastive Physical Response Score (CPRS) to isolate and evaluate models’ “audio-physics sensitivity” (Xie et al., 30 Dec 2025).

1. Motivation and Evaluation Objectives

The creation of PhyAVBench is motivated by the persistent limitations of contemporary T2AV systems, which often generate implausible or physically inconsistent audio given matched video and textual context. For instance, sounds generated for interactions with wood and metal may be indistinguishable, distant sources may be rendered at unrealistically high volumes, and distinctive signatures such as Doppler shifts are frequently ignored. Most existing evaluation frameworks emphasize semantic and temporal alignment, neglecting explicit assessment of whether underlying physical laws—such as the inverse-square distance law, wavelength-dependent diffraction, Helmholtz resonance, or Doppler effects—are learned or correctly internalized.

PhyAVBench aims to:

  • Enforce response sensitivity to single controlled physical variables, in contrast to reliance on spurious dataset biases or semantic co-occurrences.
  • Diagnose which physical phenomena current T2AV architectures are capable of modeling, and identify blind spots.
  • Establish a rigorous, standardized testbed for research on physically grounded audio-visual generation.

The benchmark’s paradigm is encapsulated in the concept of “audio-physics sensitivity”: a model’s ability to produce the correct directional change in audio output in response to an explicit, physically relevant alteration in the prompt (Xie et al., 30 Dec 2025).

2. Audio-Physics Sensitivity Test (APST) and Key Metrics

At the core of PhyAVBench is the Audio-Physics Sensitivity Test (APST), an evaluation framework that isolates a single physical variable between paired prompt input texts (PITs):

  • Prompt Pairing: Each pair differs by exactly one physically controlled variable (e.g., “knock on wood” vs. “knock on metal,” “fan at low speed” vs. “fan at high speed”).
  • Ground-Truth Direction: For each prompt, multiple real recordings are encoded using a frozen audio encoder Φ\Phi, such as that from CAV-MAE Sync. The average feature representations for prompts PaP_a and PbP_b are computed: ea,GT\mathbf{e}_{a,GT}, eb,GT\mathbf{e}_{b,GT}. The difference vGT=eb,GTea,GT\mathbf{v}_{GT} = \mathbf{e}_{b,GT} - \mathbf{e}_{a,GT} captures the “true” feature shift in response to the physical manipulation.
  • Generated Direction: The model generates one sample per prompt, which are encoded to form vgen=Φ(bgen)Φ(agen)\mathbf{v}_{gen} = \Phi(b_{gen}) - \Phi(a_{gen}).
  • Contrastive Physical Response Score (CPRS): The cosine similarity between vgen\mathbf{v}_{gen} and vGT\mathbf{v}_{GT}, renormalized to [0,1][0,1]:

CPRS=12(vgenvGTvgenvGT+1)\mathrm{CPRS} = \frac{1}{2} \left(\frac{\mathbf{v}_{gen}{\cdot}\mathbf{v}_{GT}}{\|\mathbf{v}_{gen}\|\|\mathbf{v}_{GT}\|} + 1\right)

A CPRS of 1 implies perfect directional agreement, 0.5 indicates orthogonality (no sensitivity), and 0 entails anti-alignment.

Complementary metrics incorporated in the benchmark include:

  • Fine-Grained Alignment Score (FGAS): Measures audiovisual synchronization by aligning per-frame audio and visual tokens, with

si,j=fivfjafivfja,FGAS=1Tt=1Tst,ts_{i,j} = \frac{\mathbf{f}_i^v \cdot \mathbf{f}_j^a}{\|\mathbf{f}_i^v\|\|\mathbf{f}_j^a\|}, \qquad \mathrm{FGAS} = \frac1T\sum_{t=1}^T s_{t,t}

  • CLAP Score: Semantic alignment between text and audio.
  • CLIP-SIM: Semantic alignment between text and video.
  • Signal Fidelity (FAD, FVD, KVD): Standard signal matching and fidelity metrics.
  • WER (Whisper): Speech transcription accuracy.
  • Physical Rationality MOS (PR-MOS): Human 5-point mean opinion score specifically rating physical plausibility under each prompt condition (Xie et al., 30 Dec 2025).

3. Benchmark Taxonomy and Structure

PhyAVBench is organized along three structural axes: audio-physics dimensions, real-world scenarios, and fine-grained test points.

Audio-Physics Dimensions:

  1. Sound Source Mechanics: Material hardness, geometry, impact velocity.
  2. Fluid and Aerodynamics: Flow rate, Helmholtz resonance, viscosity effects, aerodynamic “whoosh.”
  3. Sound Propagation Environment: Reverberation, diffraction, reflection, occlusion.
  4. Observer Physics: Distance law, air absorption, Doppler shift, binaural cues.
  5. Time and Causality: AV delay, transient behavior, periodicity.
  6. Complex Coupling & Extreme Physics: Phase transitions, shock waves, electromagnetic-acoustic interactions.

Real-World Scenarios:

  • Music
  • Sound effects (SFX)
  • Speech
  • Mixed content

Scenarios span indoor/outdoor, natural/man-made, static/dynamic sources.

Fine-Grained Test Points:

  • 50 test points (3–4 per dimension), e.g., “hard vs. soft surface knock,” “water vs. honey pour,” “source behind a corner (diffraction),” “supersonic whip crack,” “binaural vertical localization.”

Overall, 1,000 paired prompts are constructed (≈20 per test point), each isolating a unique, controllable physical factor (Xie et al., 30 Dec 2025).

Axis Examples Role
Audio-Physics Dim. Material, fluid flow, Doppler effect Dissects acoustic mechanisms
Scenarios SFX, music, speech Contextual diversity
Test Points Knock, pour, occlusion, resonance Isolates specific variables

4. Data Collection and Quality Assurance

To eliminate bias from existing pretraining corpora and ensure experimental integrity, PhyAVBench exclusively comprises videos that are newly recorded or meticulously curated:

  • Per prompt: A minimum of 20 independent real-world video recordings, spanning distinct performers, recording devices, and environments, each tightly controlled for the single-variable difference articulated in the prompt.
  • Pipeline:
  1. Physics Knowledge Survey: LLM-assisted brainstorming and expert curation to identify relevant phenomena.
  2. Taxonomy Construction: Hierarchical organization by underlying physics.
  3. Prompt Design: LLM-generated templates and manual refinement; ensures linguistic minimalism and single-factor isolation.
  4. Recording and Collection: Laboratory and in-situ capture under strict protocols.
  5. Iterative LLM + Human QC: Screening for confounders, ambiguities, and AV misalignment. Items failing QC are re-recorded or excluded.

Both prompt-wordings and recorded videos are subjected to multiple cycles of human review and error correction to guarantee dataset fidelity and high quality (Xie et al., 30 Dec 2025).

5. Acoustic Formulas and Theoretical References

While PhyAVBench does not introduce new acoustic theory, canonical relationships underpin both data selection and evaluation:

  • Inverse-square distance law:

ΔLp=20log10(r1/r2)\Delta L_p = 20 \log_{10}( r_1 / r_2 )

where r1r_1, r2r_2 denote source-observer distances and ΔLp\Delta L_p is the pressure level difference in dB.

  • Helmholtz resonance:

f=c2πAVLf = \frac{c}{2\pi}\,\sqrt{\frac{A}{V L}}

with cc as the speed of sound, AA the neck area, LL the neck effective length, and VV the cavity volume.

  • Doppler shift for source motion:

fobs=fscc±vsf_{obs} = f_s \frac{c}{c \pm v_s}

for source speed vsv_s, sound speed cc, and positive/negative sign for approach/recession, respectively (Xie et al., 30 Dec 2025).

These relationships define the expected outcome and directionality for physical variable manipulations in the benchmark.

6. Experimental Results and Model Limitations

Benchmark experiments on six representative T2AV systems (two commercial, four open-source) reveal:

  • High semantic and temporal alignment: CLAP and CLIP-SIM > 0.75; FGAS ≈ 0.7–0.8 across all models.
  • Low physics sensitivity: Average CPRS ≈ 0.55, barely exceeding the chance level (0.5). Correct directional responses are obtained for simple mechanics (material, object size; CPRS ≈ 0.6), but models manifest almost no sensitivity to fluid/aero effects, resonance, or more complex physical couplings (CPRS < 0.5).
  • Systematic failures: Distance law and Doppler shifts are almost never realized.
  • Human ratings: PR-MOS ≈ 3/5 for overall plausibility, but 2.2/5 for complex physics points.

These findings indicate that current T2AV systems, despite producing “plausible” audio in the generic sense, do not generally encode or apply the physical laws necessary for realistic, contextually accurate sound generation (Xie et al., 30 Dec 2025).

7. Implications and Prospects for Physically Grounded T2AV Generation

The results of PhyAVBench establish that semantic pretraining and cross-modal alignment do not suffice for enforcing physical consistency in generated audio-visual content. The benchmark provides an actionable foundation for the development of physics-aware T2AV architectures and targeted pretraining protocols. Potential research directions include:

  • Incorporation of differentiable wave-propagation simulators or physics-informed neural modules into T2AV systems.
  • Task-specific pretraining on datasets designed to stress particular physics phenomena (e.g., explicit resonance prediction, Doppler shift categorization).
  • Benchmark expansion to encompass extreme phenomena (high-energy shock waves, electromagnetic effects), longer-duration events (room impulse responses), and multi-source interaction.
  • Closer integration between visual branch geometry/material estimation and audio synthesis for end-to-end physically grounded scene understanding.

A plausible implication is that further progress in T2AV modeling will require architectural innovations and data strategies that explicitly encode or supervise with respect to physical laws, as semantic alignment metrics alone have proven insufficient in establishing true audio-physics sensitivity (Xie et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PhyAVBench.