Distil-DCCRN: Efficient Speech Enhancement

Updated 1 November 2025

The paper introduces Distil-DCCRN, a compact speech enhancement model that reduces parameters by ~70% compared to DCCRN while preserving key performance metrics.
It employs the AT-KL framework, using attention transfer and KL divergence to robustly align intermediate features between mismatched teacher and student architectures.
The model achieves reduced computational cost and latency, making it suitable for embedded, real-time applications without sacrificing audio enhancement quality.

Distil-DCCRN is a small-footprint speech enhancement model designed to retain or improve the quality of speech enhancement achieved by the Deep Complex Convolutional Recurrent Network (DCCRN) while substantially reducing model complexity. Its core contribution is a feature-centric knowledge distillation framework that enables a highly compressed student network to inherit the signal modeling capacity of a much larger and architecturally distinct teacher. Distil-DCCRN leverages both attention transfer and Kullback-Leibler divergence on intermediate features, facilitating transfer across different model families and dimensionalities.

1. Architectural Design and Parameter Reduction

Distil-DCCRN is derived from the DCCRN architecture, an IIR-realizable, complex-valued, encoder-LSTM-decoder network for time-frequency speech enhancement. Unlike the original DCCRN, which employs wider encoder/decoder channel widths and larger recurrent bottlenecks (e.g., LSTM layers), Distil-DCCRN undertakes aggressive parameter reduction:

Encoder: 6 convolutional layers, channel progression [16, 32, 64, 64, 128, 256].
LSTM bottleneck: Single layer, 64 hidden units.
Decoder: Mirrored channel configuration, reduced to match the student encoder.
STFT configuration: 25ms window, 6.25ms hop, causal operation.

This configuration reduces parameter count by approximately 70% versus baseline DCCRN (from 3.74M to 1.1M parameters) and achieves a 52% reduction in FLOPs (15.7G to 7.4G), yielding benefits in latency, device memory/runtime, and suitability for embedded low-power deployment.

2. Teacher Model and Knowledge Distillation Framework

Contrary to conventional distillation where student and teacher share the same architecture, Distil-DCCRN utilizes Uformer—an attention-augmented U-Net model with 8.82M parameters and a distinct TF-domain, attention-heavy design—as the teacher. Uformer operates on STFT features (25ms window, 10ms hop, encoder progression [16, 32, 64, 128, 256, 256]), introducing both time and channel axis mismatches between teacher and student intermediate representations.

The knowledge distillation approach in Distil-DCCRN is designed to handle these mismatches through feature-based methods rather than relying solely on output logits.

3. AT-KL: Attention Transfer and Kullback-Leibler Divergence-Based Distillation

The central innovation is the AT-KL framework: a process for intermediate feature-based knowledge distillation that is robust to structural heterogeneity and resolution discrepancies between student and teacher.

3.1. Attention Transfer (AT)

Intermediate features from candidate encoder/decoder layers ( $X^T$ for the teacher, $X^S$ for the student) are projected along the time and channel axes using the following mappings:

Time attention:

$F_t(X^T) = \sum_{i=1}^{T} \left| X^T_i \right|^{\lambda}, \quad Y^T = \frac{F_t(X^T)}{\|F_t(X^T)\|_2}$

with $\lambda=2$ and $X^T_i$ denoting activation at time index $i$ .

Channel attention:

$F_n(Y^T) = \sum_{j=1}^N \left| Y^T_j \right|^{\lambda}, \quad Z^T = \frac{F_n(Y^T)}{\|F_n(Y^T)\|_2}$

These attention maps are computed for both teacher and student; $\mathcal{L}_{AT}$ denotes the $L_\lambda$ -norm distance between corresponding attention vectors: $\mathcal{L}_{AT} = \sum_{i=1}^T \left\| Y^T_i - Y^S_i \right\|_\lambda + \sum_{j=1}^N \left\| Z^T_j - Z^S_j \right\|_\lambda$

3.2. KL Divergence on Distributions

Attention maps are further normalized via softmax to obtain probability distributions. The KL divergence $\mathcal{L}_{AT-KL}$ matches these distributions: $\mathcal{L}_{AT-KL} = \sum_{i=1}^n p_i \log\left(\frac{p_i}{q_i}\right)$ where $p_i, q_i$ are the student and teacher distributions across frequency bins.

3.3. Combined Knowledge Distillation Loss

Student model training thus minimizes a composite loss: $\mathcal{L}_{KD} = \beta \mathcal{L}_{\mathrm{SI-SNR}} + \gamma \mathcal{L}_{AT} + \eta \mathcal{L}_{AT-KL}$ with weighting coefficients $\beta=1, \gamma=1, \eta=60$ , and $\mathcal{L}_{\mathrm{SI-SNR}}$ a convex combination of hard (ground-truth) and soft (teacher output) SI-SNR objectives ( $\alpha=0.5$ ): $L_{\mathrm{SI-SNR}} = \alpha L_{\mathrm{SI-SNR(hard)}} + (1-\alpha) L_{\mathrm{SI-SNR(soft)}}$

The AT-KL framework is applied layer-wise at encoder/decoder stages, effectively transferring both feature magnitudes and their underlying statistical distributions irrespective of time/channel dimensional discrepancies.

4. Experimental Setup and Comparative Results

Distil-DCCRN and its teacher are trained on the DNS Challenge dataset using clean-noisy speech pairs with no reverberation. Uformer weights are frozen during distillation. Performance is benchmarked against DCCRN and leading small-scale models including SubbandModel, FullSubnet, GRU-512, and ES-Gabor.

Summary Table: Model Complexity and Performance

Model	Params (M)	WB-PESQ	NB-PESQ	SI-SNR	DNSMOS
Uformer	8.82	3.02	3.50	19.09	3.41
DCCRN	3.74	2.74	3.26	17.71	3.22
SubbandModel	1.82	2.55	3.24	-	-
FullSubnet	5.60	2.78	3.30	-	-
GRU-512	3.41	2.46	-	-	-
ES-Gabor	1.43	2.83	-	-	-
Distil-DCCRN	1.10	2.80	3.31	17.80	3.21

Distil-DCCRN outperforms DCCRN on WB-PESQ (+0.06), NB-PESQ (+0.05), SI-SNR (+0.09), and matches DNSMOS, despite a 70% parameter reduction. It matches or exceeds all other lightweight baselines (e.g., ES-Gabor: 2.83 WB-PESQ).

Ablation studies demonstrate that using both AT and KL distillation on all candidate layers yields optimal performance, while distillation only at output or using either AT or KL in isolation leads to lower enhancement scores.

5. Generalizability, Scalability, and Practical Implications

The AT-KL framework generalizes to any teacher-student pair with mismatched intermediate dimensions, making it broadly applicable beyond Uformer–DCCRN combinations. The causal, compact design of Distil-DCCRN is compatible with low-latency and embedded/real-time scenarios (e.g., hearing aids, mobile devices), with benefits amplified by the more than 50% reduction in FLOPs and memory footprint.

The architecture supports deployment in resource-constrained environments without sacrificing performance relative to the original DCCRN, and the distillation design is agnostic to the specific representation of intermediate features (i.e., it is not restricted by channel or time dimension alignment).

6. Significance and Context within Speech Enhancement

Distil-DCCRN represents a distinct advance in small-footprint speech enhancement by demonstrating how aggressive model compression using attention-based intermediate-feature distillation can yield student models that are both parameter- and computation-efficient, while achieving or surpassing the enhancement performance of the full-size teacher. This is critical as state-of-the-art speech enhancement models regularly exceed practical device limits in both size and computational complexity.

A plausible implication is that the AT-KL approach could be adapted to other enhancement architectures, particularly where structural or feature resolution mismatches have previously hindered knowledge distillation. The adoption of attention feature mapping and probability distribution matching sidesteps the need for direct one-to-one layer correspondence or uniform tensor shapes, which has traditionally limited the flexibility of feature-based distillation.

7. Methodological and Performance Summary

Distil-DCCRN achieves efficient, causal, high-quality speech enhancement by leveraging feature and distributional knowledge distilled from a superior, more complex teacher, despite architectural heterogeneity.
AT-KL distillation facilitates robust knowledge transfer on both compressed feature spaces and their distributions.
Empirical results indicate superior or on-par perceptual and signal fidelity (PESQ, SI-SNR, DNSMOS) as compared with larger or more computationally expensive baselines.
Applicability: The framework is broadly generalizable and suited for embedded, real-time speech processing tasks.

In conclusion, Distil-DCCRN exemplifies a high-performance, compression-friendly speech enhancement architecture enabled by robust, mismatched-structure intermediate feature distillation, and establishes a foundation for further research on principled, feature-based student-teacher transfer in speech and signal processing.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Distil-DCCRN.