Convolutive-Invariant SDR for Speech Separation

Updated 11 November 2025

CI-SDR is a metric that estimates an optimal FIR filter to align clean and estimated speech signals, avoiding penalties for benign reverberation.
It leverages a Wiener–Hopf least-squares approach to compute filter coefficients, enabling differentiable loss computation within neural networks.
Empirical evaluations show significant improvements in SDR, PESQ, and WER, demonstrating its effectiveness in realistic, reverberant multi-microphone setups.

Convolutive-Invariant Signal-to-Distortion Ratio (CI-SDR) is a training criterion and evaluation metric designed for robust performance assessment and optimization in multi-channel, reverberant speech separation problems. Unlike traditional Signal-to-Distortion Ratio (SDR) metrics, CI-SDR accounts for and is invariant to finite-length linear convolutive distortions—such as short room impulse responses (RIRs)—between the clean reference and the estimated signal. This property enables CI-SDR to provide a faithful measure of speech fidelity in scenarios where differences arising from benign channel filters or early reverberation should not be penalized.

1. Formal Definition

Let $s_\ell$ denote the reference (clean) source signal and $\hat d_\ell$ the estimated output, both of length $T$ . The key assumption is that the dominant mismatch between $s_\ell$ and $\hat d_\ell$ can be described by an unknown, finite Impulse Response (FIR) filter of length $K$ :

$(\mathbf{a} * s)_\ell = \sum_{\tau=0}^{K-1} a_\tau\,s_{\ell-\tau}$

The “best–fitting” filter coefficients are obtained by the minimization:

$\hat{\mathbf a} = \arg\min_{\mathbf a\in\mathbb R^K} \sum_{\ell}\Bigl|\sum_{\tau=0}^{K-1}a_\tau\,s_{\ell-\tau} - \hat d_\ell\Bigr|^2$

This least-squares problem admits the Wiener–Hopf closed-form solution: Let $\mathbf S\in\mathbb R^{T\times K}$ be a Toeplitz matrix formed from $s$ , then

$\hat{\mathbf a} = (\mathbf S^\top\mathbf S)^{-1}\mathbf S^\top\,\hat{\mathbf d}$

The CI-SDR (in dB) is then

$\mathrm{CI\mbox{-}SDR} = 10\log_{10} \left( \frac{ \|\mathbf S\,\hat{\mathbf a}\|^2 }{ \|\mathbf S\,\hat{\mathbf a} - \hat{\mathbf d}\|^2 } \right )$

For multi-source problems, CI-SDR is computed for each source and averaged, optionally using permutation-invariant training (PIT) to resolve source assignments.

2. Invariance Properties and Relationship to SDR Variants

Standard SDR (as in BSS Eval v2) penalizes both additive and convolutive errors. Scale-Invariant SDR (SI-SDR) relaxes the metric to allow for optimal scalar gain alignment, but still penalizes spectral (i.e., convolutional) mismatches:

$\mathrm{SI\mbox{-}SDR} = 10\log_{10} \frac{ \|a^\star s\|^2 }{ \|a^\star s - \hat d\|^2 }, \quad a^\star = \frac{\langle \hat d, s \rangle}{\|s\|^2}$

CI-SDR generalizes SI-SDR by admitting an optimal length- $K$ FIR filter, thus achieving invariance to all distortions that can be perfectly captured by such an FIR filter. In settings such as speech separation from microphone arrays in reverberant rooms, this property ensures that channel- or speaker-dependent early reverberation—which is inherently benign or unavoidable in the application context—is not incorrectly considered as error.

Metric	Invariant To	Penalizes
SDR	none	additive & convolutive
SI-SDR	scalar gain	convolutional mismatch
CI-SDR	FIR (length- $K$ )	residual, unexplainable

This systematic relaxation enables the design of loss functions that are well-matched to the separation task and the physical realities of speech acquisition.

3. Algorithmic Implementation

The computation of CI-SDR for one source proceeds as follows:

S = Toeplitz(s, K)

R_ss = S.T @ S           # (K×K) matrix
p_sd = S.T @ d_hat       # (K) vector

a_opt = solve(R_ss + eps*I, p_sd)    # e.g. via Cholesky or torch.solve

s_proj = S @ a_opt         # length T

num = norm(s_proj)**2
den = norm(s_proj - d_hat)**2 + eps
CI_SDR = 10 * log10(num/den)

Practical considerations:

The regularization term $\epsilon I$ is essential for numerical stability, especially during silence or near-singular Toeplitz matrices.
For multi-source systems, the above is repeated for each source and outputs are averaged or selected per PIT.
Toeplitz matrix construction and correlation products can be implemented efficiently and with streaming if memory constraints preclude explicit storage.

4. Use as a Differentiable Loss in Neural Network Training

CI-SDR can be used directly as a neural network training loss. The loss for $I$ sources is

$\mathcal{L}_{\mathrm{CI\mbox{-}SDR}} = -\frac{1}{I} \sum_{i=1}^{I} \mathrm{CI\mbox{-}SDR}_i$

All operations—matrix construction, multiplication, least-squares solve, norm, and logarithm—are differentiable and compatible with automatic differentiation frameworks (e.g., PyTorch autograd). The loss landscape is shaped such that the network predicts outputs for which a short FIR filter can optimally project clean sources onto the estimated signals, minimizing true perceptual and ASR-relevant distortions.

Key points for implementation:

Use torch.solve or torch.cholesky_solve to ensure differentiability.
Maintain a small positive $\epsilon$ for conditioning.
No approximation or surrogate loss is necessary; true CI-SDR can be computed directly.

This suggests seamless integration with end-to-end deep learning pipelines for audio source separation, given sufficient computational resources.

5. Experimental Findings and Performance

On a two-speaker, seven-microphone circular array speech separation task using LibriSpeech utterances with reverberant environments ( $T_{60}=0.15\!-\!0.6$ s, SNR 10–20 dB), systems were trained using identical mask-estimator architectures but different loss functions (frequency-domain SDR, time-domain SDR, SI-SDR, CI-SDR), with test-time MVDR beamforming.

Key evaluation metrics for each configuration are summarized below:

Loss	PESQ	SDR (dB)	STOI	WER (%)
F-SDR	1.99	15.4	0.893	7.9
SDR	1.98	15.1	0.893	6.7
SI-SDR	2.01	15.6	0.895	6.9
CI-SDR	2.46	20.4	0.930	4.4

Replacing the standard power-iteration RTF estimator with a full eigen-decomposition further reduced WER to 4.2%. In comparison, a conventional PIT system using a frequency-domain loss yielded WER 45.6%. An oracle mask + MVDR combination achieved WER 3.8%. Thus, CI-SDR:

Improved PESQ by 0.45 over SI-SDR.
Increased BSS Eval SDR by approximately 4.8 dB.
Halved WER (6.9% to 4.4%).
Approached oracle-mask baseline performance.

These results indicate a plausible implication that CI-SDR provides a more meaningful optimization signal for systems in realistic multi-microphone, reverberant speech separation scenarios.

6. Limitations, Practical Notes, and Perspectives

Computational Complexity and Memory

Forming $\mathbf S\in\mathbb R^{T\times K}$ and computing $\mathbf S^\top\mathbf S$ involves $O(TK^2)$ operations. For $K=512$ , $T\sim10^5$ , this is tractable on contemporary GPUs.
Solving the regularized normal equations via Cholesky is $O(K^3)$ per utterance, acceptable for $K\lesssim 1024$ .
Storing the full Toeplitz matrix can be memory-intensive; accumulating correlation products in streamed fashion avoids this bottleneck.
Regularization ( $\epsilon \geq 10^{-3}$ ) is necessary for stability, especially in low-energy regions.

Implementation Considerations

Numerical stability may be challenged in silent portions or under extreme ill-conditioning; regularization mitigates this.
Chunk-wise or streaming estimation is feasible to reduce memory demands or for online applications.

Future Directions

Extension to end-to-end tasks beyond source separation such as dereverberation or echo cancellation, whenever the relation between target and reference can be expressed as a convolution.
Joint ASR + CI-SDR optimization, integrating the loss function with sequence-level ASR criteria for further Word Error Rate minimization.
Variable-length or adaptive FIR filters per utterance—enabling the system to learn effective filter lengths in response to environmental variability.
Direct integration of CI-SDR loss into multi-channel, time-domain architectures such as TasNet variants.

CI-SDR thus formalizes an effective, differentiable measure of distance between waveforms in settings where unknown, benign convolutional filtering should not degrade the assessed signal quality or learning objectives, providing a metric and optimization target closely aligned with the realities of real-world acoustic environments (Boeddeker et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation (2020)

Follow Topic

Get notified by email when new papers are published related to Convolutive-Invariant SDR (CI-SDR).