OnlineSpatialNet: Real-Time Spatial Noise Estimation

Updated 22 April 2026

OnlineSpatialNet is a lightweight, causal neural architecture designed for estimating the scale-normalized Cholesky factor of spatial noise covariance in multichannel speech enhancement.
It employs an interleaved structure of narrow-band and cross-band blocks to process STFT signals efficiently, supporting direction-preserving MIMO Wiener filtering.
Its frame-by-frame, online operation and reduced computational footprint enable real-time applications such as beamforming, binaural rendering, and direction-of-arrival estimation.

OnlineSpatialNet is a lightweight neural architecture for the online, causal estimation of spatial noise covariance in multichannel speech enhancement. It operates as a front-end for direction-preserving multiple-input multiple-output (MIMO) speech enhancement, directly estimating a scale-normalized Cholesky factor of the frequency-domain noise covariance. This enables the deployment of direction-preserving MIMO Wiener filtering, enhancing not only the target speech signal but also preserving spatial characteristics essential for downstream applications such as beamforming, binaural rendering, and direction-of-arrival estimation (Deppisch, 13 Apr 2026).

1. Network Architecture and Processing

OnlineSpatialNet processes multichannel short-time Fourier transform (STFT) signals with a streamlined architecture that emphasizes causal, frame-by-frame operation:

Input: $M = 6$ complex STFT channels $x(t, f) \in \mathbb{C}^M$ per time-frequency bin.
Front-end: A $1 \times 1$ real-valued convolution projects the concatenated real and imaginary STFT components to a 64-dimensional latent space.
Interleaved Structure: Four narrow-band blocks (operating per frequency $f$ $f$ ) alternate with four cross-band blocks:
- Narrow-band blocks employ a causal retention mechanism—a causal alternative to self-attention—coupled with a temporal-convolutional feed-forward network (T-ConvFFN) with hidden size 128 and GELU activation.
- Cross-band blocks mix frequency channels via $1 \times 1$ convolutions (hidden size 64, ReLU activation).
Output Head: A complex-valued linear layer produces $M(M+1)/2$ $M (M + 1) /2$ real parameters encoding the lower-triangular Cholesky factor $L(t, f)$ $L (t, f)$ of the spatial covariance:
- Off-diagonal elements are unconstrained complex values.
- Diagonal elements are strictly positive, enforced by a softplus activation plus $\epsilon \approx 10^{-6}$ .
Parameter and Computational Complexity:
- Total parameters: approximately 0.82 M.
- Measured complexity: 23.23 GFLOPs/s on 32 kHz audio.
Causality: All convolutions and retention blocks are strictly causal, requiring only past and present frames.

2. Covariance Estimation via Scale-Normalized Cholesky Factor

At each frequency bin, OnlineSpatialNet employs a sequence of normalization and estimation steps to ensure scale-invariant and causal covariance estimation:

Sample Covariance Estimation: The short-term sample mixture covariance is causally averaged over a 100 ms window:

$\hat{R}_{xx}(t, f) = \frac{1}{N} \sum_{\tau} x(\tau, f)x(\tau, f)^{H}$

Scale Computation: The global frequency-dependent energy scale is computed:

$\gamma(f) = \frac{1}{T} \sum_{t=1}^T \frac{1}{M}\operatorname{tr}\{\hat{R}_{xx}(t, f)\}$

Mixture Normalization: Mixture STFT is normalized:

$x(t, f) \in \mathbb{C}^M$ 0

Cholesky Factor Prediction: OnlineSpatialNet predicts the Cholesky factor:

$x(t, f) \in \mathbb{C}^M$ 1

Noise Covariance Formation: The scale-normalized noise covariance is constructed:

$x(t, f) \in \mathbb{C}^M$ 2

Normalization Consistency: Mixture covariance is analogously normalized:

$x(t, f) \in \mathbb{C}^M$ 3

This normalization renders the MIMO filter invariant to channel scaling.

3. End-to-End Training with Composite Loss

OnlineSpatialNet is trained in conjunction with a differentiable direction-preserving multi-channel Wiener filter (DP-MWF), optimizing both signal quality and covariance estimation:

Composite Loss: $x(t, f) \in \mathbb{C}^M$ $x (t, f) \in C^{M}$ 4 with $x(t, f) \in \mathbb{C}^M$ $x (t, f) \in C^{M}$ 5
- Multichannel SI-SDR Loss:
$x(t, f) \in \mathbb{C}^M$ 6

$x(t, f) \in \mathbb{C}^M$ 7 - Normalized Frobenius Loss (Cholesky):

$x(t, f) \in \mathbb{C}^M$ 8

where $x(t, f) \in \mathbb{C}^M$ 9 is the oracle Cholesky factor of true $1 \times 1$ 0.

Training is fully causal and online, supporting real-time deployment.

4. Direction-Preserving MIMO Wiener Filtering

OnlineSpatialNet outputs parameterize the DP-MWF, a direction-preserving extension of the multi-channel Wiener filter [Herzog & Habets, 2021]:

Classical MIMO Wiener Filter:

$1 \times 1$ 1

Parametric DP-MWF:

$1 \times 1$ 2

$1 \times 1$ 3

$1 \times 1$ 4

In experiments, $1 \times 1$ 5, $1 \times 1$ 6, and $1 \times 1$ 7.

5. Online, Real-Time Operation

OnlineSpatialNet is designed for low-latency streaming scenarios:

Causal Processing: No look-ahead; only past and present frames are used.
Covariance Window: Mixture statistics are averaged over a trailing 100 ms window (400 STFT frames at 32 kHz).
Single-Frame Inference: Each output Cholesky factor requires only the current normalized input frame and the retention block states.

This supports real-time operation on continuous multichannel audio.

6. Comparative Complexity and Efficiency

Compared to a mask-based MIMO baseline (NICE), OnlineSpatialNet exhibits significantly reduced computational and parameter cost, while matching or exceeding performance in covariance estimation and target metrics:

Model	Params [M]	GFLOPs/s
OnlineSpatialNet	0.82	23.23
NICE	2.54	59.71

OnlineSpatialNet achieves efficient direct estimation of the MIMO covariance structure, as opposed to channel mask inference plus averaging used in mask-based methods.

7. Quantitative Performance and Application Impact

On six-microphone: circular array, DNS 4 dataset, and simulated acoustic environments, OnlineSpatialNet demonstrates:

Metric	Unprocessed	NICE	OnlineSpatialNet	Oracle DP-MWF
SI-SDR (dB) ↑	0.02	8.50	9.37	11.01
$1 \times 1$ 8 ↓	n/a	0.38	0.32	n/a
NR (dB) ↑	n/a	12.11	11.72	15.61
CovSim ↑	n/a	0.92	0.93	n/a
SpeechSim ↑	0.79	0.82	0.83	0.90
NoiseSim ↑	n/a	0.90	0.89	0.88

Downstream Task Performance:
- Beamforming (DS SI-SDR, dB): OnlineSpatialNet = 5.61, NICE = 5.27, Oracle = 6.46.
- Binaural rendering (ΔILD error, dB): OnlineSpatialNet = 0.28, NICE = 0.37, Oracle = 0.20.

These results show that OnlineSpatialNet closely approaches oracle DP-MWF in speech enhancement and downstream spatial tasks, with a markedly lower computational footprint (Deppisch, 13 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Direction-Preserving MIMO Speech Enhancement Using a Neural Covariance Estimator (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OnlineSpatialNet.