Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR)

Updated 3 December 2025

SI-SAR is a metric for quantifying artifacts in music source separation by isolating residual errors independent of global gain.
It improves alignment with human ratings, particularly for drums and bass, by focusing solely on artifact distortions while omitting interference errors.
SI-SAR employs optimal scaling and residual projection to achieve a scale-invariant assessment of audio quality in mono signals.

The Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR) is an objective performance metric for music source separation systems designed to quantify the presence of artifacts in estimated audio signals after optimally removing gain and interference effects. SI-SAR is particularly relevant in contexts where artifactural distortions, such as unnatural noises or processing artifacts, are perceptually salient, and it intentionally omits interference error contributions by construction. Empirical results on the MUSDB18 dataset demonstrate its superior alignment with human listener ratings for certain instrumental stems, especially drums and bass. SI-SAR forms part of a spectrum of energy-ratio metrics, distinguishing itself from source-to-distortion ratio (SDR) and scale-invariant SDR (SI-SDR) by its focus and invariance properties (Jaffe et al., 9 Jul 2025).

1. Mathematical Foundation

Let $s[n]$ denote the clean reference source and $\hat{y}[n]$ the estimated source. Other ground-truth stems are indexed as $\{s_k[n]\}_{k \ne j}$ . All signals are downmixed to mono and strictly time-aligned.

The SI-SAR computation proceeds as follows:

Optimal Scaling (to remove gain differences):

$\alpha = \frac{\sum_n \hat{y}[n]s[n]}{\sum_n s[n]^2}$

Target and Residual Components:

$e_{\text{target}}[n] = \alpha s[n]$

$e_{\text{residual}}[n] = \hat{y}[n] - e_{\text{target}}[n]$

Interference Projection:

$e_{\text{interf}}[n] = \sum_{k \ne j} \left[\frac{\sum_n e_{\text{residual}}[n] s_k[n]}{\sum_n s_k[n]^2}\right] s_k[n]$

Artifact Component:

$e_{\text{artifact}}[n] = e_{\text{residual}}[n] - e_{\text{interf}}[n]$

SI-SAR Metric (in dB):

$\text{SI-SAR} = 10 \log_{10} \left( \frac{\|e_{\text{target}}\|^2}{\|e_{\text{artifact}}\|^2} \right)$

with $\|\cdot\|^2 = \sum_n (\cdot)^2$ .

Term Clarification

$e_{\text{target}}$ : optimally scaled reference signal (true source energy).
$e_{\text{artifact}}$ : error residual orthogonal to both the target and all interfering sources (music artifacts, noise).
This construction infers complete invariance to global gain changes of $\hat{y}[n]$ .

2. Distinction from SDR and SI-SDR Metrics

SI-SAR operates within the same conceptual domain as SDR and SI-SDR but differs in its decomposition and focus:

Metric	Denominator Error Terms	Scale Invariance	Artifact Sensitivity
SDR	$e_{\text{spatial}} + e_{\text{interf}} + e_{\text{artifact}}$	No (gain sensitive)	Low
SI-SDR	$e_{\text{interf}} + e_{\text{artifact}}$	Yes	Moderate
SI-SAR	$e_{\text{artifact}}$	Yes	High

BSS Eval v4 SDR retains spatial filtering (512 taps), is sensitive to global gain, and may mask underlying distortions.
SI-SDR removes scale sensitivity using a scalar gain and merges all errors except the target into one composite term.
SI-SAR isolates artifact error, disregarding interference, making it particularly sensitive to artifact distortions (Jaffe et al., 9 Jul 2025).

3. Empirical Evaluation Protocol: MUSDB18

The MUSDB18 dataset served as the evaluation corpus. The procedure iterated over reference stems (vocals, drums, bass, other) and their corresponding estimates as follows:

Segmentation: Extracted 10-second clips per SiSEC2018 protocol.
Estimation: Generated multiple ( $N$ ) system outputs for each stem.
Signal Preparation: Downmixed all signals to mono, enforced exact temporal alignment.
Metric Computation: Calculated SI-SAR for each estimate using the definition above, on the full 10-second segment.
Correlation Analysis: For each listener, stem, and track, ranked $N$ estimates by SI-SAR; calculated Kendall’s τ between rank order and listener ratings ($0$–$100$ scale).
Aggregation: Averaged τ values over listeners and tracks; reported by stem.

This evaluation explicitly avoided windowing beyond the basic segmentation, relying on direct metric computation across entire clips for consistency.

4. Comparative Performance: Correlation with Human Judgments

Kendall’s τ correlations between objective metrics and human rankings on MUSDB18 are summarized:

Stem Type	BSSEval SDR	BSSEval SAR	SI-SDR	SI-SAR
Vocals	0.316	0.258	0.197	0.246
Drums	0.165	0.124	0.203	0.240
Bass	0.086	0.181	0.084	0.116
Other	0.273	0.199	0.277	0.271
Avg	0.211	0.190	0.190	0.218

SI-SAR outperformed SI-SDR and other legacy metrics for drums and bass stems. For vocals, BSSEval SDR remained the best single metric. The highest average τ over all stems was achieved by SI-SAR (0.218), indicating better general predictive alignment with listener rankings compared to alternatives in contexts where artifact errors dominate perceptual judgments (Jaffe et al., 9 Jul 2025).

5. Interpretive Context and Author Recommendations

Results suggest SI-SAR is especially effective for stems where artifact errors significantly affect perceived quality, notably drums and bass. This is attributed to its denominator isolating artifact residuals after projections onto both target and interfering sources.

For vocals, however, traditional SDR remains optimal, likely due to human sensitivity to spatial fidelity and gain mismatches. Authors explicitly advise against a universal metric: stem-specific selection is recommended, contingent on the perceptual characteristics most relevant to each source type.

Limitations of SI-SAR include:

Applicability restricted to mono signals; spatial imaging errors are ignored.
Excludes perceptual aspects beyond energy ratios, such as timbral deviations.
May underweight interference errors, especially in scenarios dominated by masking phenomena.

6. Technical Implementation

The following pseudocode provides a concrete and precise implementation of SI-SAR, as evaluated in the referenced paper:

import numpy as np

def compute_si_sar(est, ref, interferers=[]):
    # est, ref, interferers: 1D numpy arrays, same length
    # 1) optimal scaling
    alpha = np.dot(est, ref) / np.dot(ref, ref)
    e_target = alpha * ref
    # 2) residual
    e_res = est - e_target
    # 3) project out interference
    e_if = np.zeros_like(e_res)
    for s_k in interferers:
        coeff = np.dot(e_res, s_k) / np.dot(s_k, s_k)
        e_if += coeff * s_k
    # 4) artifact component
    e_art = e_res - e_if
    # 5) compute SI-SAR
    num = np.sum(e_target**2)
    den = np.sum(e_art**2) + 1e-12  # avoid div0
    si_sar_db = 10 * np.log10(num / den)
    return si_sar_db

In evaluation loops, reference and interfering stems are loaded, system estimates processed, and resulting SI-SAR scores subjected to correlation analysis against listener rankings, as detailed above (Jaffe et al., 9 Jul 2025).

7. Practical and Methodological Implications

SI-SAR provides a robust, energy-based, and scale-invariant mechanism for quantifying artifactural degradations in source separation outputs. Its design supports transparent error attribution, improved interpretability for instrument stem evaluation, and enhanced consistency when global gain and interference artifacts vary unpredictably.

A plausible implication is that future studies on music source separation may benefit from metric ensembles where SI-SAR ranks artifacts and SDR evaluates target fidelity, within a stem-adaptive evaluation protocol. However, the mono-only constraint and the inability to capture complex perceptual transformations indicate potential areas for further methodological refinement and metric innovation (Jaffe et al., 9 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Musical Source Separation Bake-Off: Comparing Objective Metrics with Human Perception (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR).