Papers
Topics
Authors
Recent
Search
2000 character limit reached

OnlineSpatialNet: Real-Time Spatial Noise Estimation

Updated 22 April 2026
  • OnlineSpatialNet is a lightweight, causal neural architecture designed for estimating the scale-normalized Cholesky factor of spatial noise covariance in multichannel speech enhancement.
  • It employs an interleaved structure of narrow-band and cross-band blocks to process STFT signals efficiently, supporting direction-preserving MIMO Wiener filtering.
  • Its frame-by-frame, online operation and reduced computational footprint enable real-time applications such as beamforming, binaural rendering, and direction-of-arrival estimation.

OnlineSpatialNet is a lightweight neural architecture for the online, causal estimation of spatial noise covariance in multichannel speech enhancement. It operates as a front-end for direction-preserving multiple-input multiple-output (MIMO) speech enhancement, directly estimating a scale-normalized Cholesky factor of the frequency-domain noise covariance. This enables the deployment of direction-preserving MIMO Wiener filtering, enhancing not only the target speech signal but also preserving spatial characteristics essential for downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation (Deppisch, 13 Apr 2026).

1. Network Architecture and Processing

OnlineSpatialNet processes multichannel short-time Fourier transform (STFT) signals with a streamlined architecture that emphasizes causal, frame-by-frame operation:

  • Input: M=6M = 6 complex STFT channels x(t,f)CMx(t, f) \in \mathbb{C}^M per time-frequency bin.
  • Front-end: A 1×11 \times 1 real-valued convolution projects the concatenated real and imaginary STFT components to a 64-dimensional latent space.
  • Interleaved Structure: Four narrow-band blocks (operating per frequency ff) alternate with four cross-band blocks:
    • Narrow-band blocks employ a causal retention mechanism—a causal alternative to self-attention—coupled with a temporal-convolutional feed-forward network (T-ConvFFN) with hidden size 128 and GELU activation.
    • Cross-band blocks mix frequency channels via 1×11 \times 1 convolutions (hidden size 64, ReLU activation).
  • Output Head: A complex-valued linear layer produces M(M+1)/2M(M+1)/2 real parameters encoding the lower-triangular Cholesky factor L(t,f)L(t, f) of the spatial covariance:
    • Off-diagonal elements are unconstrained complex values.
    • Diagonal elements are strictly positive, enforced by a softplus activation plus ϵ106\epsilon \approx 10^{-6}.
  • Parameter and Computational Complexity:
    • Total parameters: approximately 0.82 M.
    • Measured complexity: 23.23 GFLOPs/s on 32 kHz audio.
  • Causality: All convolutions and retention blocks are strictly causal, requiring only past and present frames.

2. Covariance Estimation via Scale-Normalized Cholesky Factor

At each frequency bin, OnlineSpatialNet employs a sequence of normalization and estimation steps to ensure scale-invariant and causal covariance estimation:

  1. Sample Covariance Estimation: The short-term sample mixture covariance is causally averaged over a 100 ms window:

R^xx(t,f)=1Nτx(τ,f)x(τ,f)H\hat{R}_{xx}(t, f) = \frac{1}{N} \sum_{\tau} x(\tau, f)x(\tau, f)^{H}

  1. Scale Computation: The global frequency-dependent energy scale is computed:

γ(f)=1Tt=1T1Mtr{R^xx(t,f)}\gamma(f) = \frac{1}{T} \sum_{t=1}^T \frac{1}{M}\operatorname{tr}\{\hat{R}_{xx}(t, f)\}

  1. Mixture Normalization: Mixture STFT is normalized:

x(t,f)CMx(t, f) \in \mathbb{C}^M0

  1. Cholesky Factor Prediction: OnlineSpatialNet predicts the Cholesky factor:

x(t,f)CMx(t, f) \in \mathbb{C}^M1

  1. Noise Covariance Formation: The scale-normalized noise covariance is constructed:

x(t,f)CMx(t, f) \in \mathbb{C}^M2

  1. Normalization Consistency: Mixture covariance is analogously normalized:

x(t,f)CMx(t, f) \in \mathbb{C}^M3

This normalization renders the MIMO filter invariant to channel scaling.

3. End-to-End Training with Composite Loss

OnlineSpatialNet is trained in conjunction with a differentiable direction-preserving multi-channel Wiener filter (DP-MWF), optimizing both signal quality and covariance estimation:

  • Composite Loss: x(t,f)CMx(t, f) \in \mathbb{C}^M4 with x(t,f)CMx(t, f) \in \mathbb{C}^M5

    x(t,f)CMx(t, f) \in \mathbb{C}^M6

    x(t,f)CMx(t, f) \in \mathbb{C}^M7 - Normalized Frobenius Loss (Cholesky):

    x(t,f)CMx(t, f) \in \mathbb{C}^M8

    where x(t,f)CMx(t, f) \in \mathbb{C}^M9 is the oracle Cholesky factor of true 1×11 \times 10.

Training is fully causal and online, supporting real-time deployment.

4. Direction-Preserving MIMO Wiener Filtering

OnlineSpatialNet outputs parameterize the DP-MWF, a direction-preserving extension of the multi-channel Wiener filter [Herzog & Habets, 2021]:

  • Classical MIMO Wiener Filter:

1×11 \times 11

  • Parametric DP-MWF:

1×11 \times 12

1×11 \times 13

1×11 \times 14

In experiments, 1×11 \times 15, 1×11 \times 16, and 1×11 \times 17.

5. Online, Real-Time Operation

OnlineSpatialNet is designed for low-latency streaming scenarios:

  • Causal Processing: No look-ahead; only past and present frames are used.

  • Covariance Window: Mixture statistics are averaged over a trailing 100 ms window (400 STFT frames at 32 kHz).

  • Single-Frame Inference: Each output Cholesky factor requires only the current normalized input frame and the retention block states.

This supports real-time operation on continuous multichannel audio.

6. Comparative Complexity and Efficiency

Compared to a mask-based MIMO baseline (NICE), OnlineSpatialNet exhibits significantly reduced computational and parameter cost, while matching or exceeding performance in covariance estimation and target metrics:

Model Params [M] GFLOPs/s
OnlineSpatialNet 0.82 23.23
NICE 2.54 59.71

OnlineSpatialNet achieves efficient direct estimation of the MIMO covariance structure, as opposed to channel mask inference plus averaging used in mask-based methods.

7. Quantitative Performance and Application Impact

On six-microphone: circular array, DNS 4 dataset, and simulated acoustic environments, OnlineSpatialNet demonstrates:

Metric Unprocessed NICE OnlineSpatialNet Oracle DP-MWF
SI-SDR (dB) ↑ 0.02 8.50 9.37 11.01
1×11 \times 18 ↓ n/a 0.38 0.32 n/a
NR (dB) ↑ n/a 12.11 11.72 15.61
CovSim ↑ n/a 0.92 0.93 n/a
SpeechSim ↑ 0.79 0.82 0.83 0.90
NoiseSim ↑ n/a 0.90 0.89 0.88
  • Downstream Task Performance:
    • Beamforming (DS SI-SDR, dB): OnlineSpatialNet = 5.61, NICE = 5.27, Oracle = 6.46.
    • Binaural rendering (ΔILD error, dB): OnlineSpatialNet = 0.28, NICE = 0.37, Oracle = 0.20.

These results show that OnlineSpatialNet closely approaches oracle DP-MWF in speech enhancement and downstream spatial tasks, with a markedly lower computational footprint (Deppisch, 13 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OnlineSpatialNet.