OnlineSpatialNet: Real-Time Spatial Noise Estimation
- OnlineSpatialNet is a lightweight, causal neural architecture designed for estimating the scale-normalized Cholesky factor of spatial noise covariance in multichannel speech enhancement.
- It employs an interleaved structure of narrow-band and cross-band blocks to process STFT signals efficiently, supporting direction-preserving MIMO Wiener filtering.
- Its frame-by-frame, online operation and reduced computational footprint enable real-time applications such as beamforming, binaural rendering, and direction-of-arrival estimation.
OnlineSpatialNet is a lightweight neural architecture for the online, causal estimation of spatial noise covariance in multichannel speech enhancement. It operates as a front-end for direction-preserving multiple-input multiple-output (MIMO) speech enhancement, directly estimating a scale-normalized Cholesky factor of the frequency-domain noise covariance. This enables the deployment of direction-preserving MIMO Wiener filtering, enhancing not only the target speech signal but also preserving spatial characteristics essential for downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation (Deppisch, 13 Apr 2026).
1. Network Architecture and Processing
OnlineSpatialNet processes multichannel short-time Fourier transform (STFT) signals with a streamlined architecture that emphasizes causal, frame-by-frame operation:
- Input: complex STFT channels per time-frequency bin.
- Front-end: A real-valued convolution projects the concatenated real and imaginary STFT components to a 64-dimensional latent space.
- Interleaved Structure: Four narrow-band blocks (operating per frequency ) alternate with four cross-band blocks:
- Narrow-band blocks employ a causal retention mechanism—a causal alternative to self-attention—coupled with a temporal-convolutional feed-forward network (T-ConvFFN) with hidden size 128 and GELU activation.
- Cross-band blocks mix frequency channels via convolutions (hidden size 64, ReLU activation).
- Output Head: A complex-valued linear layer produces real parameters encoding the lower-triangular Cholesky factor of the spatial covariance:
- Off-diagonal elements are unconstrained complex values.
- Diagonal elements are strictly positive, enforced by a softplus activation plus .
- Parameter and Computational Complexity:
- Total parameters: approximately 0.82 M.
- Measured complexity: 23.23 GFLOPs/s on 32 kHz audio.
- Causality: All convolutions and retention blocks are strictly causal, requiring only past and present frames.
2. Covariance Estimation via Scale-Normalized Cholesky Factor
At each frequency bin, OnlineSpatialNet employs a sequence of normalization and estimation steps to ensure scale-invariant and causal covariance estimation:
- Sample Covariance Estimation: The short-term sample mixture covariance is causally averaged over a 100 ms window:
- Scale Computation: The global frequency-dependent energy scale is computed:
- Mixture Normalization: Mixture STFT is normalized:
0
- Cholesky Factor Prediction: OnlineSpatialNet predicts the Cholesky factor:
1
- Noise Covariance Formation: The scale-normalized noise covariance is constructed:
2
- Normalization Consistency: Mixture covariance is analogously normalized:
3
This normalization renders the MIMO filter invariant to channel scaling.
3. End-to-End Training with Composite Loss
OnlineSpatialNet is trained in conjunction with a differentiable direction-preserving multi-channel Wiener filter (DP-MWF), optimizing both signal quality and covariance estimation:
- Composite Loss: 4 with 5
- Multichannel SI-SDR Loss:
6
7 - Normalized Frobenius Loss (Cholesky):
8
where 9 is the oracle Cholesky factor of true 0.
Training is fully causal and online, supporting real-time deployment.
4. Direction-Preserving MIMO Wiener Filtering
OnlineSpatialNet outputs parameterize the DP-MWF, a direction-preserving extension of the multi-channel Wiener filter [Herzog & Habets, 2021]:
- Classical MIMO Wiener Filter:
1
- Parametric DP-MWF:
2
3
4
In experiments, 5, 6, and 7.
5. Online, Real-Time Operation
OnlineSpatialNet is designed for low-latency streaming scenarios:
Causal Processing: No look-ahead; only past and present frames are used.
Covariance Window: Mixture statistics are averaged over a trailing 100 ms window (400 STFT frames at 32 kHz).
Single-Frame Inference: Each output Cholesky factor requires only the current normalized input frame and the retention block states.
This supports real-time operation on continuous multichannel audio.
6. Comparative Complexity and Efficiency
Compared to a mask-based MIMO baseline (NICE), OnlineSpatialNet exhibits significantly reduced computational and parameter cost, while matching or exceeding performance in covariance estimation and target metrics:
| Model | Params [M] | GFLOPs/s |
|---|---|---|
| OnlineSpatialNet | 0.82 | 23.23 |
| NICE | 2.54 | 59.71 |
OnlineSpatialNet achieves efficient direct estimation of the MIMO covariance structure, as opposed to channel mask inference plus averaging used in mask-based methods.
7. Quantitative Performance and Application Impact
On six-microphone: circular array, DNS 4 dataset, and simulated acoustic environments, OnlineSpatialNet demonstrates:
| Metric | Unprocessed | NICE | OnlineSpatialNet | Oracle DP-MWF |
|---|---|---|---|---|
| SI-SDR (dB) ↑ | 0.02 | 8.50 | 9.37 | 11.01 |
| 8 ↓ | n/a | 0.38 | 0.32 | n/a |
| NR (dB) ↑ | n/a | 12.11 | 11.72 | 15.61 |
| CovSim ↑ | n/a | 0.92 | 0.93 | n/a |
| SpeechSim ↑ | 0.79 | 0.82 | 0.83 | 0.90 |
| NoiseSim ↑ | n/a | 0.90 | 0.89 | 0.88 |
- Downstream Task Performance:
- Beamforming (DS SI-SDR, dB): OnlineSpatialNet = 5.61, NICE = 5.27, Oracle = 6.46.
- Binaural rendering (ΔILD error, dB): OnlineSpatialNet = 0.28, NICE = 0.37, Oracle = 0.20.
These results show that OnlineSpatialNet closely approaches oracle DP-MWF in speech enhancement and downstream spatial tasks, with a markedly lower computational footprint (Deppisch, 13 Apr 2026).