Papers
Topics
Authors
Recent
2000 character limit reached

BSANN: Adaptive Binaural Audio Network

Updated 13 January 2026
  • BSANN is a neural framework that generates spatial binaural audio by adaptively conditioning on scene, source, and listener attributes, accurately reproducing ITD and ILD cues.
  • It integrates multimodal inputs like visual, positional, and depth data through object detection and Fourier embeddings to create a dynamic 3D audio map.
  • Physically informed modeling with diffusion-based architectures and adaptive filter banks ensures high-fidelity spatial rendering and efficient real-time deployment.

A Binaural Spatially Adaptive Neural Network (BSANN) is a neural framework designed to generate or render spatially precise binaural audio signals by adaptively conditioning on scene, source, and listener attributes. The architecture leverages deep learning to produce audio with accurate interaural cues, including interaural time difference (ITD) and interaural level difference (ILD), thereby yielding immersive, spatially consistent sound fields in real or synthesized environments. BSANN methods support dynamic source movement, head tracking, and per-ear acoustic control, integrating physically informed modeling—including Head-Related Transfer Functions (HRTFs) and loudspeaker directivity—within neural network training and inference workflows (Zhao et al., 18 Aug 2025, Jiang et al., 10 Jan 2026).

1. Architectural Fundamentals and Spatial Adaptivity

BSANN models comprise modular architectures reflecting the task domain:

  • Vision-Aligned Binaural Generation: FoleySpace implements BSANN as three submodules—(a) sound-source estimation via object/detector with depth inference, (b) analytic 2D-to-3D mapping, and (c) a diffusion-based waveform generator modulated by spatial embeddings (Zhao et al., 18 Aug 2025).
  • Personal Sound Zones (PSZs) Rendering: BSANN for PSZ uses a shared multilayer perceptron (MLP) that receives listener head pose vectors and produces frequency-domain loudspeaker filter coefficients for each ear of each tracked listener. Inputs are Fourier positional-encoded poses; ear locations are computed analytically (Jiang et al., 10 Jan 2026).

A characteristic feature is dynamic spatial adaptivity: neural parameters are modulated at runtime—via FiLM-style embedding layers (Zhao et al., 18 Aug 2025) or adaptive filter banks (Jiang et al., 10 Jan 2026)—by time-varying source or listener state, permitting sample-wise control over earwise audio properties.

2. Multimodal Input Processing and Coordinate Mapping

BSANN systems integrate visual, positional, and/or listener data:

  • In FoleySpace, video frames {Ik}k=1K\{I_k\}_{k=1}^K, optionally accompanied by text queries, are processed by an open-vocabulary object detector (YOLO-World backbone), yielding 2D coordinates (wk,hk)(w_k,h_k) and depth dkd_k for candidate sound sources. Cross-attention layers allow text-guided detection (Zhao et al., 18 Aug 2025).
  • Depth maps are generated by a monocular depth model (DepthMaster), with spatial values sampled at detected source locations: dk=Dk[hk,wk]d_k = D_k[h_k, w_k].
  • A normalized mapping function translates image-derived (wk,hk,dk)(w_k,h_k,d_k) to physical 3D coordinates (xk,yk,zk)(x_k, y_k, z_k), smoothed by outlier rejection and interpolation:

xk=δd~k,yk=δ(wkW/2),zk=δ(hkH/2)x_k = \delta\,\tilde{d}_k,\quad y_k = \delta(w_k - W/2),\quad z_k = -\delta(h_k - H/2)

with d~k\tilde{d}_k a normalized, scaled depth.

In multi-listener PSZ tasks, the system encodes pose vectors (xi,qi)(\mathbf{x}_i, \mathbf{q}_i) per listener through a Fourier positional embedding, informing the adaptive filter design for each ear.

3. Diffusion-Based and Filter-Bank Binaural Audio Generation

FoleySpace Diffusion Model

The BSANN in FoleySpace uses pretrained video-to-monaural models to generate smono(t)s^\mathrm{mono}(t), synchronizing it with the 3D trajectory T~\tilde{\mathcal{T}}. A modified DiffWave diffusion model receives the concatenated trajectory and monaural signals in a condition embedding CC, modulating each convolutional block’s output:

h+1=Conv[(1+γ(C))h+β(C)]h_{\ell+1} = \mathrm{Conv}_\ell\big[(1+\gamma_\ell(C))\odot h_\ell + \beta_\ell(C)\big]

This mechanism enables real-time adaptation to moving sources, adjusting binaural cues in response to spatial information. The diffusion process optimizes the denoising loss:

Ldiff=Ez0,ϵN(0,I),t[ϵϵθ(zt,t,c)22]L_\mathrm{diff} = \mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0,I), t}\left[\|\epsilon - \epsilon_\theta(z_t, t, c)\|_2^2\right]

(Zhao et al., 18 Aug 2025).

Ear-Optimized Filter Banks for PSZ

BSANN for PSZ predicts a filter bank g(ωn)CL×4\mathbf{g}(\omega_n) \in \mathbb{C}^{L \times 4} that determines the driving signals for LL loudspeakers and 2x2 earwise stereo channels. The model outputs real and imaginary parts per frequency bin, with time-domain compactness and gain limiting regularization.

The generated filters reconstruct target acoustic fields at designated ear control points, using a spatial mapping from input pose vectors and program channels (Jiang et al., 10 Jan 2026).

4. Physically Informed Acoustic Modeling and HRTF Integration

Robust binaural rendering requires physically grounded modeling:

  • FoleySpace: Training data are constructed by convolving random mono waveforms with measured HRIRs hL(ϕ,θ,d),hR(ϕ,θ,d)h_L(\phi, \theta, d), h_R(\phi, \theta, d) from the HUTUBS database. Distance scaling is achieved via time-domain resampling. Moving sources are synthesized by segment-wise HRIR convolution, with cross-fades smoothing transitions (Zhao et al., 18 Aug 2025).
  • PSZ: Loudspeaker-to-ear transfer functions are constructed by merging simulated direct/reflected RIRs with anechoic speaker measurements, analytic directivity, and normalized rigid-sphere HRTFs:

H~e,m,(ω)=He,m,dir(ω)A(ω)D(ω,θe,m,)HHRTF,e,m(ω)+He,m,refl(ω)A(ω)\tilde{H}_{e, m, \ell}(\omega) = H^\mathrm{dir}_{e, m, \ell}(\omega)\,A_\ell(\omega)\,D_\ell(\omega,\theta_{e,m,\ell})\,H_\mathrm{HRTF,e,m}(\omega) + H^\mathrm{refl}_{e, m, \ell}(\omega)A_\ell(\omega)

(Jiang et al., 10 Jan 2026).

This multimodal integration enables the network to learn accurate ITD/ILD and spatial cues across listener positions and dynamic scenes.

5. Training Strategies and Loss Functions

BSANN optimization incorporates multiple loss objectives:

  • Diffusion Loss: Denoising loss LdiffL_\mathrm{diff} aligns the generated waveform with HRIR-convolved binaural ground truth. Optional reconstruction and spatial regularizations (e.g., ITD/ILD targets) are described but not used in FoleySpace (Zhao et al., 18 Aug 2025).
  • PSZ Multi-Term Loss: Total loss during pretraining weighs bright zone matching, dark zone suppression, gain limiting, and filter tap compactness:

LPSZ=αLBZ+(1α)LDZ+βLgain+γLcompact\mathcal{L}_\mathrm{PSZ} = \alpha\,\mathcal{L}_\mathrm{BZ} + (1-\alpha)\,\mathcal{L}_\mathrm{DZ} + \beta\,\mathcal{L}_\mathrm{gain} + \gamma\,\mathcal{L}_\mathrm{compact}

  • Active XTC Stage: Fine-tuning with crosstalk cancellation loss LXTC\mathcal{L}_\mathrm{XTC} reduces inter-ear leakage via off-diagonal, diagonal-matching, and regularizer penalties, with additional teacher-anchored terms preserving perceptual fidelity (Jiang et al., 10 Jan 2026).

6. Objective Metrics and Empirical Performance

BSANN-based systems are assessed using frequency-weighted variants of:

  • Inter-Zone Isolation (IZI): Ratio of intended program energy at listener 1 vs. leakage to listener 2.
  • Inter-Program Isolation (IPI): Ratio of energy between intended and non-intended programs at the same listener.
  • Crosstalk Cancellation (XTC): Ipsilateral vs contralateral energy in binaural rendering.

Reported metrics for BSANN-PSZ over 100–20,000 Hz: | IZI₁ / IZI₂ (dB) | IPI₁ / IPI₂ (dB) | XTC₁ / XTC₂ (dB) | |------------------|------------------|------------------| | 10.23 / 10.03 | 11.11 / 9.16 | 10.55 / 11.13 |

Compared to SANN-PSZ, BSANN yields substantial increases, especially in XTC (+2.62/+2.94 dB), demonstrating improved robustness and stereo fidelity across listeners, even in room-asymmetric conditions (Jiang et al., 10 Jan 2026).

7. Deployment, Complexity, and Practical Considerations

BSANN deployment benefits from efficient architecture and physically grounded preprocessing:

  • Training Data: For PSZ, \sim100,000 simulated RIRs across head/ear pose variations; GPU-accelerated synthesis. FoleySpace trains on HRIR-convolved random mono clips (Zhao et al., 18 Aug 2025, Jiang et al., 10 Jan 2026).
  • Computation: Real-time operation is feasible (single MLP pass per pose/frequency bin, fast convolutional encode-decode for diffusion), contingent on low-latency head tracking and GPU/fast CPU support.
  • Filter Conversion: Binaural filters can be translated to IIR structure for embedded audio rendering.
  • Physical Calibration: Only anechoic loudspeaker measurements and analytic directivity are required—no site-specific in-room calibration necessary for PSZ deployment (Jiang et al., 10 Jan 2026).

A plausible implication is that BSANN architectures, by unifying spatial adaptivity with physically informed modeling and efficient runtime, have enabled practical, high-fidelity spatial audio rendering in both synthesized (V2A) and real acoustic scenarios.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Binaural Spatially Adaptive Neural Network (BSANN).