Complex Ratio Masks (CRM) in Audio Processing
- Complex Ratio Masks (CRMs) are time–frequency-domain masks defined on the STFT that predict both real and imaginary components to correct phase artifacts.
- They are estimated by complex-valued neural networks, such as U-Net and BLSTM models, which integrate spatial features and employ consistency constraints for effective training.
- CRMs yield measurable gains in SDR, PESQ, and SI-SDR compared to magnitude-only masks, significantly improving audio source separation and speech enhancement tasks.
A Complex Ratio Mask (CRM) is a time–frequency-domain mask defined on the complex short-time Fourier transform (STFT) of a signal that enables simultaneous magnitude and phase estimation. CRMs generalize conventional magnitude masks by directly predicting both real and imaginary components, allowing explicit correction of phase artifacts that limit the performance of magnitude-only approaches in speech enhancement, music source separation, and multi-channel spatial filtering. Recent research demonstrates that CRMs yield significant improvements in objective and perceptual metrics, notably signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ), when integrated into modern deep learning frameworks for audio source separation and enhancement tasks (Jansson et al., 2021, Goswami et al., 2020, Du et al., 2019, Gu et al., 2021).
1. Mathematical Formulation and Interpretation
Let denote the complex STFT of a mixture signal, and the STFT of a target source. The CRM at each time–frequency (T–F) bin is defined as: Expressed via real and imaginary parts: Thus, mixes and rescales the input’s real and imaginary parts, while facilitates phase rotation.
When a CRM is applied to the mixture,
or, in polar form if and ,
0
so the CRM scales the magnitude and rotates the phase to approximate the target.
CRMs can be constrained by applying pointwise 1 nonlinearities to avoid unbounded mask values, such that 2 and 3 for a neural output 4 (Jansson et al., 2021, Goswami et al., 2020).
2. Network Architectures and Training Paradigms
CRMs are typically estimated by neural networks operating in the T–F domain. Input representations include either the magnitude 5 (for magnitude-only models) or explicit real and imaginary components 6 (for complex models).
Common architectural choices:
- U-Net: Six-layer encoder–decoder with skip connections and real-valued convolutions. Input fed as two real channels (re/imag), outputting two real-valued masks which are recombined to form 7 (Jansson et al., 2021).
- Complex U-Net/BLSTM: All convolutions and batch norms generalized to the complex domain; complex arithmetic (using paired real convolutions), complex activations (Gu et al., 2021).
- Complex-valued LSTM (RCLSTM): Sequential models processing complex-valued inputs using paired real LSTMs recombined as
8
Architectures may incorporate spatial features (e.g., inter-channel phase differences, directional features) for multi-channel scenarios, stacking these as complex-valued feature tensors (Gu et al., 2021).
3. Supervised Learning Objectives
CRM estimation is driven by composite losses reflecting both spectral and time-domain criteria:
- Spectral Losses: 9 penalizes deviations in estimated versus reference magnitudes (Jansson et al., 2021). Fully complex MSE losses, 0, can be used to jointly optimize real and imaginary components (Jansson et al., 2021, Goswami et al., 2020).
- Time-domain Losses: Scale-invariant source-to-distortion ratio (SI-SDR) is a common criterion, optimizing the negative cosine similarity between estimated and reference time-domain signals recovered via inverse STFT (Jansson et al., 2021, Gu et al., 2021):
1
- Hybrid Losses: Networks may be trained with a sum of spectral and SI-SDR losses for best convergence and generalization (Jansson et al., 2021).
- Consistent Spectrogram Masking: Losses can be computed after inverse STFT to enforce that the estimated spectrogram corresponds to an actual time-domain signal, reducing artifacts and improving convergence (Du et al., 2019).
Bounding the CRM’s real and imaginary parts by 2 nonlinearities mitigates instability due to small 3 magnitudes (Goswami et al., 2020).
4. Advantages Over Magnitude-Only Masking
Magnitude masking estimates only 4 and reuses the input mixture phase, leading to inherent phase artifacts, notably "phase smearing" and musical noise. In contrast, CRMs enable:
- Phase Correction: Rotates the input phase to match the target, reducing residual artifacts (Jansson et al., 2021).
- Interference Cancellation: Imaginary component of the CRM can cancel interference in phase, impossible for real-valued masks (Jansson et al., 2021).
- Sharper Transients: Improved reconstruction of percussive onsets when compared to magnitude-only separation (Jansson et al., 2021).
Quantitative gains are consistent across tasks and datasets:
- SDR improvements of up to 51 dB on bass and drums, and 0.3–0.7 dB for vocals and "other" (Jansson et al., 2021).
- Perceptual gains in PESQ (e.g., RCLSTM achieves PESQ = 2.62, a 4.3% improvement over real-valued counterparts) (Goswami et al., 2020).
- In multi-channel separation, SI-SDR gains of 61.1 dB (12.1% relative) and up to 33.1% relative reduction in word error rate versus baseline real-mask separation (Gu et al., 2021).
Subjective evaluation confirms perceptible reductions in artifacts and increased preference for CRM-based outputs (Jansson et al., 2021).
5. Extensions and Consistency Constraints
Application of an arbitrary complex mask can move a spectrogram out of the consistent subspace, yielding artifacts through STFT/ISTFT inconsistency (Du et al., 2019). Consistent Spectrogram Masking (CSM) addresses this by defining the loss over the reconstructed waveform after ISTFT, guaranteeing consistency and bringing two benefits:
- Reduced solution space (faster training convergence).
- Fewer artifacts due to frame misalignment.
CSM yields faster training (10–20% fewer epochs) and measurable PESQ/SNR improvements over standard CRM masking (Du et al., 2019).
In multi-channel scenarios, estimated CRMs can be integrated with Minimum Variance Distortionless Response (MVDR) beamforming. Here, the CRM output informs spatial correlation matrix estimation and spatial filtering, further boosting SI-SDR and reducing recognition errors in ASR tasks (Gu et al., 2021).
6. Experimental Benchmarks
Empirical evaluations corroborate the superiority of CRM approaches:
| Task/Dataset | Metric | Magnitude Mask | CRM (Best) |
|---|---|---|---|
| MUSDB18 (vocals) | SDR (dB) | 3.6 | 3.9 |
| MUSDB18 (bass) | SDR (dB) | 6.0 | 6.4 |
| MUSDB18 (drums) | SDR (dB) | 4.6 | 5.7 |
| VoiceBank+DEMAND | PESQ | 2.51–2.55 | 2.62 |
| Sim. multi-ch (speech) | SI-SDR (dB) | 10.6–11.3 | 12.0* |
| WER (%) | 25.96–18.04 | 17.03* |
*With MVDR post-filter integration. Source: (Jansson et al., 2021, Gu et al., 2021, Goswami et al., 2020).
Largest CRM gains are observed for low-frequency sources or high-SNR regimes, and in settings with significant phase-related interference.
7. Limitations and Future Directions
Although CRMs address key deficiencies of magnitude-only masking, challenges remain:
- Mask inconsistency can introduce artifacts if not constrained; CSM approaches offer a promising mitigation (Du et al., 2019).
- CRM estimation is more demanding computationally and architecturally, motivating research into efficient complex-valued networks and deeper integration of complex arithmetic (Gu et al., 2021, Goswami et al., 2020).
- Further advances are anticipated in fully complex-gated recurrent architectures, application to direct time-domain enhancement, and extension to more general spatial scenarios.
A plausible implication is that CRMs, combined with consistency constraints and complex architectures, will define new baselines for both single- and multi-channel audio enhancement and separation tasks.
References:
- (Jansson et al., 2021) Learned complex masks for multi-instrument source separation
- (Goswami et al., 2020) Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM
- (Du et al., 2019) End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking
- (Gu et al., 2021) Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain