Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR)
- SI-SAR is a metric for quantifying artifacts in music source separation by isolating residual errors independent of global gain.
- It improves alignment with human ratings, particularly for drums and bass, by focusing solely on artifact distortions while omitting interference errors.
- SI-SAR employs optimal scaling and residual projection to achieve a scale-invariant assessment of audio quality in mono signals.
The Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR) is an objective performance metric for music source separation systems designed to quantify the presence of artifacts in estimated audio signals after optimally removing gain and interference effects. SI-SAR is particularly relevant in contexts where artifactural distortions, such as unnatural noises or processing artifacts, are perceptually salient, and it intentionally omits interference error contributions by construction. Empirical results on the MUSDB18 dataset demonstrate its superior alignment with human listener ratings for certain instrumental stems, especially drums and bass. SI-SAR forms part of a spectrum of energy-ratio metrics, distinguishing itself from source-to-distortion ratio (SDR) and scale-invariant SDR (SI-SDR) by its focus and invariance properties (Jaffe et al., 9 Jul 2025).
1. Mathematical Foundation
Let denote the clean reference source and the estimated source. Other ground-truth stems are indexed as . All signals are downmixed to mono and strictly time-aligned.
The SI-SAR computation proceeds as follows:
- Optimal Scaling (to remove gain differences):
- Target and Residual Components:
- Interference Projection:
- Artifact Component:
- SI-SAR Metric (in dB):
with .
Term Clarification
- : optimally scaled reference signal (true source energy).
- : error residual orthogonal to both the target and all interfering sources (music artifacts, noise).
- This construction infers complete invariance to global gain changes of .
2. Distinction from SDR and SI-SDR Metrics
SI-SAR operates within the same conceptual domain as SDR and SI-SDR but differs in its decomposition and focus:
| Metric | Denominator Error Terms | Scale Invariance | Artifact Sensitivity |
|---|---|---|---|
| SDR | No (gain sensitive) | Low | |
| SI-SDR | Yes | Moderate | |
| SI-SAR | Yes | High |
- BSS Eval v4 SDR retains spatial filtering (512 taps), is sensitive to global gain, and may mask underlying distortions.
- SI-SDR removes scale sensitivity using a scalar gain and merges all errors except the target into one composite term.
- SI-SAR isolates artifact error, disregarding interference, making it particularly sensitive to artifact distortions (Jaffe et al., 9 Jul 2025).
3. Empirical Evaluation Protocol: MUSDB18
The MUSDB18 dataset served as the evaluation corpus. The procedure iterated over reference stems (vocals, drums, bass, other) and their corresponding estimates as follows:
- Segmentation: Extracted 10-second clips per SiSEC2018 protocol.
- Estimation: Generated multiple () system outputs for each stem.
- Signal Preparation: Downmixed all signals to mono, enforced exact temporal alignment.
- Metric Computation: Calculated SI-SAR for each estimate using the definition above, on the full 10-second segment.
- Correlation Analysis: For each listener, stem, and track, ranked estimates by SI-SAR; calculated Kendall’s τ between rank order and listener ratings ($0$–$100$ scale).
- Aggregation: Averaged τ values over listeners and tracks; reported by stem.
This evaluation explicitly avoided windowing beyond the basic segmentation, relying on direct metric computation across entire clips for consistency.
4. Comparative Performance: Correlation with Human Judgments
Kendall’s τ correlations between objective metrics and human rankings on MUSDB18 are summarized:
| Stem Type | BSSEval SDR | BSSEval SAR | SI-SDR | SI-SAR |
|---|---|---|---|---|
| Vocals | 0.316 | 0.258 | 0.197 | 0.246 |
| Drums | 0.165 | 0.124 | 0.203 | 0.240 |
| Bass | 0.086 | 0.181 | 0.084 | 0.116 |
| Other | 0.273 | 0.199 | 0.277 | 0.271 |
| Avg | 0.211 | 0.190 | 0.190 | 0.218 |
SI-SAR outperformed SI-SDR and other legacy metrics for drums and bass stems. For vocals, BSSEval SDR remained the best single metric. The highest average τ over all stems was achieved by SI-SAR (0.218), indicating better general predictive alignment with listener rankings compared to alternatives in contexts where artifact errors dominate perceptual judgments (Jaffe et al., 9 Jul 2025).
5. Interpretive Context and Author Recommendations
Results suggest SI-SAR is especially effective for stems where artifact errors significantly affect perceived quality, notably drums and bass. This is attributed to its denominator isolating artifact residuals after projections onto both target and interfering sources.
For vocals, however, traditional SDR remains optimal, likely due to human sensitivity to spatial fidelity and gain mismatches. Authors explicitly advise against a universal metric: stem-specific selection is recommended, contingent on the perceptual characteristics most relevant to each source type.
Limitations of SI-SAR include:
- Applicability restricted to mono signals; spatial imaging errors are ignored.
- Excludes perceptual aspects beyond energy ratios, such as timbral deviations.
- May underweight interference errors, especially in scenarios dominated by masking phenomena.
6. Technical Implementation
The following pseudocode provides a concrete and precise implementation of SI-SAR, as evaluated in the referenced paper:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import numpy as np def compute_si_sar(est, ref, interferers=[]): # est, ref, interferers: 1D numpy arrays, same length # 1) optimal scaling alpha = np.dot(est, ref) / np.dot(ref, ref) e_target = alpha * ref # 2) residual e_res = est - e_target # 3) project out interference e_if = np.zeros_like(e_res) for s_k in interferers: coeff = np.dot(e_res, s_k) / np.dot(s_k, s_k) e_if += coeff * s_k # 4) artifact component e_art = e_res - e_if # 5) compute SI-SAR num = np.sum(e_target**2) den = np.sum(e_art**2) + 1e-12 # avoid div0 si_sar_db = 10 * np.log10(num / den) return si_sar_db |
In evaluation loops, reference and interfering stems are loaded, system estimates processed, and resulting SI-SAR scores subjected to correlation analysis against listener rankings, as detailed above (Jaffe et al., 9 Jul 2025).
7. Practical and Methodological Implications
SI-SAR provides a robust, energy-based, and scale-invariant mechanism for quantifying artifactural degradations in source separation outputs. Its design supports transparent error attribution, improved interpretability for instrument stem evaluation, and enhanced consistency when global gain and interference artifacts vary unpredictably.
A plausible implication is that future studies on music source separation may benefit from metric ensembles where SI-SAR ranks artifacts and SDR evaluates target fidelity, within a stem-adaptive evaluation protocol. However, the mono-only constraint and the inability to capture complex perceptual transformations indicate potential areas for further methodological refinement and metric innovation (Jaffe et al., 9 Jul 2025).