Speech-preserving active noise control: a deep learning approach in reverberant environments

Published 13 Apr 2026 in eess.SP, cs.SD, and eess.AS | (2604.10979v1)

Abstract: Traditional Active Noise Control (ANC) systems are mostly based on FxLMS algorithms, but such algorithms rely on linear assumptions and are often limited in handling broadband non-stationary noise or nonlinear acoustic paths. Not only that, the traditional method is used to eliminating all signals together, and noise reduction often accidentally damages the voice signal and affects normal communication. To tackle these issues, this study proposes a speech preserving deep learning ANC system, which aims to achieve stable noise reduction while effectively retaining speech in a complex acoustic environment. This study builds an end-to-end control architecture, the core of which adopts a Convolutional Recurrent Network (CRN). The structure uses the long short-term memory (LSTM) network to capture the time-related characteristics of acoustic signals. Combined with complex spectrum mapping (CSM) technology, the nonlinear distortion problem is effectively solved. In order to retain useful voice while removing noise, this study also designs a special voice retention loss function. This design guidance model selectively retains the target voice while suppressing environmental noise by identifying the characteristics of the spectrum structure. In addition, in order to verify whether the system is effective in real scenes, we use the Image Source Method (ISM) to build a high-fidelity acoustic simulation environment, which also simulates the real reverberation effect. Experimental results demonstrate that the proposed Deep ANC system achieves significantly better noise reduction than the traditional FxLMS algorithm, especially for non-stationary noises like crowd babble. Meanwhile, PESQ and STOI based evaluations confirm that the system preserves both the naturalness and intelligibility of the target speech.

Abstract PDF Upgrade to Chat

Authors (1)

Shuning Dai

Summary

The paper introduces a deep learning-based ANC that preserves speech while canceling noise in reverberant settings using a CRN architecture.
It employs complex spectral mapping and a tailored supervised loss to ensure phase-consistent, high-fidelity speech retention in diverse noise scenarios.
Experimental results demonstrate up to +14.1 dB noise reduction and improved PESQ/STOI scores, outperforming traditional FxLMS methods.

Speech-Preserving Active Noise Control via Deep Learning in Reverberant Environments

Introduction

The presented research systematically addresses the intrinsic limitations of traditional adaptive Active Noise Control (ANC) systems—most notably the linearity and lack of selectivity in FxLMS-based frameworks—by introducing a fully speech-preserving, deep learning based ANC solution employing a Convolutional Recurrent Network (CRN) architecture. The principal innovation lies in robust selective noise cancellation while maintaining the fidelity and intelligibility of target speech in complex, reverberant environments modeled via physical room impulse responses (RIRs).

Limitations of Classical ANC and Motivations

Conventional ANC relies heavily on adaptive linear algorithms, with FxLMS being the de facto industry standard for decades. Fundamental drawbacks include poor tracking of broadband, non-stationary, or nonlinear noise due to explicit linearity assumptions and a single-objective design that naively cancels all input signals indiscriminately. In scenarios ubiquitous to industrial, automotive, or personal audio, this leads to degradation of critical speech signals. The demand for a new ANC paradigm, capable of nonlinear acoustic path modeling and task-driven selectivity, is thus acute.

Deep Learning for ANC: Architectural and Methodological Advances

Deep neural architectures, particularly those integrating CNN and RNN mechanisms, have demonstrated superior representational capacity for both spectral and temporal audio dependencies. This work adopts a CRN core, leveraging causal convolution in the encoder for local spectral feature extraction and stacked LSTM layers for temporal context modeling at the bottleneck. The decoder reconstructs the target complex spectrum with phase-aware skip connections, minimizing both perceptual and physical reconstruction error. This ensures causal inference for real-time deployment.

Complex Spectral Mapping (CSM) constitutes another critical advancement—joint estimation of real and imaginary spectral components (via STFT domain operation) enables fine-grained phase and amplitude modeling necessary for effective physical cancellation, as opposed to magnitude-only approaches typical of conventional SE pipelines.

Speech Preservation: Loss Formulation and Training

The main contributor of speech-selectivity is a supervised loss function formulated to enforce acoustic transparency: the ANC system is explicitly penalized for residual error deviating from the desired speech component at the error microphone, post-propagation through modeled physical paths. Formally, this is expressed as mean square error between the actual error microphone signal and the pure, reference-propagated target speech. The result is that the network learns a mapping that phase-cancels only background noise while faithfully reconstructing the target speech. The secondary path is embedded as a nontrainable convolution in the backpropagation framework.

High-Fidelity Acoustic Simulation

To address the gap between simulated and operational realities, the work adopts the Image Source Method (ISM) for generating high-fidelity RIRs, simulating moderate room reverberation (RT60 = 0.3 s) in a single-channel feedforward geometry. This allows for the faithful evaluation of the model under physically plausible reverberant and multipath conditions.

Experimental Evaluation: Quantitative and Qualitative Advances

Extensive benchmarking against FxLMS—deploying identical room acoustics and strictly controlled SNR conditions—is conducted over a diverse set of noise backgrounds, including stationary (Volvo, Engine), non-stationary (Babble, Factory1), and broadband (F16). The Deep ANC exhibits substantial increases over FxLMS in Noise Reduction (NR), with improvements ranging from +8.6 dB to +14.1 dB. Notably, in the Babble (non-stationary, highly speech-overlapping) condition, Deep ANC achieves +12.9 dB NR advantage.

Perceptually, speech quality and intelligibility are robustly preserved, as quantified by elevated PESQ (up to +0.686) and STOI (up to +0.101) improvements across all backgrounds; this is achieved without introducing notable speech distortion even under low SNR mixtures. Qualitative spectrogram and waveform analyses confirm that the system executes selective suppression: fundamental and harmonic structures of noise are cancelled, while speech formants and temporal envelopes remain unaffected.

Macro-level robustness is demonstrated through uniform high NR capability (>17 dB) across all tested noise classes and immediate steady-state convergence in non-stationary contexts, far outpacing the adaptation lag intrinsic to online linear filters.

Fundamental Mechanistic Insights

Two core mechanisms undergird the observed efficacy:

Temporal and Nonlinear Modeling: LSTM-based temporal encoding provides rapid adaptation to fluctuating noise profiles, overcoming the slow, statistics-bound adaptation of linear filters.
Phase-Consistent Complex Mapping: CSM ensures that anti-noise signals match the amplitude and phase required for superposition-based cancellation in the physical sound field, a constraint ignored by naive amplitude-domain SE approaches.

Theoretical and Practical Implications

The proposed method establishes that high-capacity, physically-informed deep learning is sufficient to bridge the selectivity and adaptation deficits inherent to legacy ANC. The realization of robust, real-time, and highly speech-transparent ANC in reverberant, multipath-laden environments validates deep models as viable controllers in practical deployments.

For applications in communications, augmented reality, automotive, and industrial environments, this enables ANC systems that no longer trade speech comprehension for noise attenuation, representing a substantive improvement over legacy hardware and DSP-centric solutions.

Future Directions

The findings highlight three principal axes for future development:

Low-Latency and Time-Domain Processing: Migrating to time-domain architectures (e.g., Wave-U-Net) could enable sub-frame latency, which is essential for ultra-low-delay scenarios.
Multi-Channel and Spatially Selective Control: Extension to spatially aware, multi-channel CRN-based controllers will allow simultaneous spectral and directional selectivity.
Adaptive and Hybrid Architectures: Integrating dynamic, on-device meta-learning and hybrid (deep learning + adaptive filter) controllers can provide adaptability to nonstationary acoustic environments and varying path transfer functions with reduced computational overhead.

Conclusion

This study demonstrates that end-to-end, speech-preserving Deep ANC with CRN and CSM architectures is empirically and theoretically superior to classical ANC across a spectrum of real-world, reverberant, and complex noise environments. The approach achieves robust, selective noise suppression, high-fidelity speech retention, and generalization to varied acoustic conditions, establishing a foundation for next-generation intelligent audio control systems.

Markdown Report Issue