- The paper introduces a deep learning-based ANC that preserves speech while canceling noise in reverberant settings using a CRN architecture.
- It employs complex spectral mapping and a tailored supervised loss to ensure phase-consistent, high-fidelity speech retention in diverse noise scenarios.
- Experimental results demonstrate up to +14.1 dB noise reduction and improved PESQ/STOI scores, outperforming traditional FxLMS methods.
Speech-Preserving Active Noise Control via Deep Learning in Reverberant Environments
Introduction
The presented research systematically addresses the intrinsic limitations of traditional adaptive Active Noise Control (ANC) systems—most notably the linearity and lack of selectivity in FxLMS-based frameworks—by introducing a fully speech-preserving, deep learning based ANC solution employing a Convolutional Recurrent Network (CRN) architecture. The principal innovation lies in robust selective noise cancellation while maintaining the fidelity and intelligibility of target speech in complex, reverberant environments modeled via physical room impulse responses (RIRs).
Limitations of Classical ANC and Motivations
Conventional ANC relies heavily on adaptive linear algorithms, with FxLMS being the de facto industry standard for decades. Fundamental drawbacks include poor tracking of broadband, non-stationary, or nonlinear noise due to explicit linearity assumptions and a single-objective design that naively cancels all input signals indiscriminately. In scenarios ubiquitous to industrial, automotive, or personal audio, this leads to degradation of critical speech signals. The demand for a new ANC paradigm, capable of nonlinear acoustic path modeling and task-driven selectivity, is thus acute.
Deep Learning for ANC: Architectural and Methodological Advances
Deep neural architectures, particularly those integrating CNN and RNN mechanisms, have demonstrated superior representational capacity for both spectral and temporal audio dependencies. This work adopts a CRN core, leveraging causal convolution in the encoder for local spectral feature extraction and stacked LSTM layers for temporal context modeling at the bottleneck. The decoder reconstructs the target complex spectrum with phase-aware skip connections, minimizing both perceptual and physical reconstruction error. This ensures causal inference for real-time deployment.
Complex Spectral Mapping (CSM) constitutes another critical advancement—joint estimation of real and imaginary spectral components (via STFT domain operation) enables fine-grained phase and amplitude modeling necessary for effective physical cancellation, as opposed to magnitude-only approaches typical of conventional SE pipelines.
The main contributor of speech-selectivity is a supervised loss function formulated to enforce acoustic transparency: the ANC system is explicitly penalized for residual error deviating from the desired speech component at the error microphone, post-propagation through modeled physical paths. Formally, this is expressed as mean square error between the actual error microphone signal and the pure, reference-propagated target speech. The result is that the network learns a mapping that phase-cancels only background noise while faithfully reconstructing the target speech. The secondary path is embedded as a nontrainable convolution in the backpropagation framework.
High-Fidelity Acoustic Simulation
To address the gap between simulated and operational realities, the work adopts the Image Source Method (ISM) for generating high-fidelity RIRs, simulating moderate room reverberation (RT60 = 0.3 s) in a single-channel feedforward geometry. This allows for the faithful evaluation of the model under physically plausible reverberant and multipath conditions.
Experimental Evaluation: Quantitative and Qualitative Advances
Extensive benchmarking against FxLMS—deploying identical room acoustics and strictly controlled SNR conditions—is conducted over a diverse set of noise backgrounds, including stationary (Volvo, Engine), non-stationary (Babble, Factory1), and broadband (F16). The Deep ANC exhibits substantial increases over FxLMS in Noise Reduction (NR), with improvements ranging from +8.6 dB to +14.1 dB. Notably, in the Babble (non-stationary, highly speech-overlapping) condition, Deep ANC achieves +12.9 dB NR advantage.
Perceptually, speech quality and intelligibility are robustly preserved, as quantified by elevated PESQ (up to +0.686) and STOI (up to +0.101) improvements across all backgrounds; this is achieved without introducing notable speech distortion even under low SNR mixtures. Qualitative spectrogram and waveform analyses confirm that the system executes selective suppression: fundamental and harmonic structures of noise are cancelled, while speech formants and temporal envelopes remain unaffected.
Macro-level robustness is demonstrated through uniform high NR capability (>17 dB) across all tested noise classes and immediate steady-state convergence in non-stationary contexts, far outpacing the adaptation lag intrinsic to online linear filters.
Fundamental Mechanistic Insights
Two core mechanisms undergird the observed efficacy:
- Temporal and Nonlinear Modeling: LSTM-based temporal encoding provides rapid adaptation to fluctuating noise profiles, overcoming the slow, statistics-bound adaptation of linear filters.
- Phase-Consistent Complex Mapping: CSM ensures that anti-noise signals match the amplitude and phase required for superposition-based cancellation in the physical sound field, a constraint ignored by naive amplitude-domain SE approaches.
Theoretical and Practical Implications
The proposed method establishes that high-capacity, physically-informed deep learning is sufficient to bridge the selectivity and adaptation deficits inherent to legacy ANC. The realization of robust, real-time, and highly speech-transparent ANC in reverberant, multipath-laden environments validates deep models as viable controllers in practical deployments.
For applications in communications, augmented reality, automotive, and industrial environments, this enables ANC systems that no longer trade speech comprehension for noise attenuation, representing a substantive improvement over legacy hardware and DSP-centric solutions.
Future Directions
The findings highlight three principal axes for future development:
- Low-Latency and Time-Domain Processing: Migrating to time-domain architectures (e.g., Wave-U-Net) could enable sub-frame latency, which is essential for ultra-low-delay scenarios.
- Multi-Channel and Spatially Selective Control: Extension to spatially aware, multi-channel CRN-based controllers will allow simultaneous spectral and directional selectivity.
- Adaptive and Hybrid Architectures: Integrating dynamic, on-device meta-learning and hybrid (deep learning + adaptive filter) controllers can provide adaptability to nonstationary acoustic environments and varying path transfer functions with reduced computational overhead.
Conclusion
This study demonstrates that end-to-end, speech-preserving Deep ANC with CRN and CSM architectures is empirically and theoretically superior to classical ANC across a spectrum of real-world, reverberant, and complex noise environments. The approach achieves robust, selective noise suppression, high-fidelity speech retention, and generalization to varied acoustic conditions, establishing a foundation for next-generation intelligent audio control systems.