End-to-End Acoustic Echo Cancellation

Updated 30 January 2026

E2E-AEC is a unified neural framework that integrates time-frequency masking, attention-based alignment, and speech activity awareness for echo and noise suppression.
It employs diverse architectures—including time-frequency, time-domain, and hybrid schemes with adaptive filtering—to meet real-time, low-latency requirements.
Models are trained with joint loss functions like SI-SNR and perceptual metrics, ensuring robust performance and high speech quality in challenging acoustic environments.

End-to-End Acoustic Echo Cancellation (E2E-AEC) refers to neural architectures and training schemes that perform all acoustic echo suppression and, optionally, noise suppression and related enhancement tasks in a unified, direct, data-driven pipeline. Such systems eschew traditional cascades of linear adaptive filters and post-processing in favor of joint, fully differentiable models optimized on global criteria. E2E-AEC methods typically integrate time-frequency masking, attention-based alignment, and speech activity awareness, and are designed to operate causal and in real time. The field encompasses both pure end-to-end neural AEC as well as hybrid approaches tightly integrating neural modules with adaptive filtering or spatial processing.

1. Architectural Paradigms

E2E-AEC systems vary in their core architectural choices, with primary distinctions along the following axes:

End-to-end mask-based models operating in the time-frequency domain: These models, such as DTLN and complex-masking neural architectures, predict multiplicative masks applied to the STFT representations of the microphone and far-end signals. Core components are LSTM stacks or convolutional recurrent layers, with dual-branch or cascade module designs to refine both magnitude and phase, as in "Acoustic echo cancellation with the dual-signal transformation LSTM network" (Westhausen et al., 2020) and complex modular networks with mask optimization (Liu et al., 2022).
Fully time-domain neural architectures: These models (e.g., EchoFilter (Ma et al., 2021), waveform-domain conformer-based AEC (Panchapagesan et al., 2022)) process raw audio samples directly via temporal convolutional encoders and mask estimators, sometimes using LSTM or conformer-based sequence modeling with stride-optimized front ends for low latency.
Hybrid schemes with explicit adaptive filtering: Some approaches, such as NN3A (Wang et al., 2021) and MC-TCN (Shu et al., 2021), first perform classical adaptive filtering (e.g., weighted-RLS, multi-delay frequency-domain filters) to remove the linear echo component, then employ a neural module for residual echo and noise suppression. In real-time deployments, such separation increases robustness across a wide range of acoustic conditions.
Attention-based and time-delay compensating models: Attention blocks (e.g., TF-GridNet attention in E2E-AEC (Jiang et al., 23 Jan 2026), local attention in EchoFilter (Ma et al., 2021), or multi-scale attention (Ma et al., 2021)) are critical in managing the variable alignment between reference and microphone signals, handling unknown system delays and device nonlinearity.

2. Signal Representation, Input Features, and Alignment

E2E-AEC systems exploit differing input feature choices:

STFT-based features: Most models operate in the time-frequency domain, stacking real and imaginary parts, magnitude, and phase from both reference and mixture signals as inputs. Time-frequency features are often concatenated, sometimes augmented with auxiliary quantities (e.g., the error signal after initial echo suppression, or concatenations of P, Q, P+Q, P–Q as in cD3Net (Watcharasupat et al., 2021)).
Time-domain representations: Encoders map waveform frames to learned latent spaces (e.g., 1D convolutions in EchoFilter, small-stride analysis in conformer-based models), minimizing need for manual feature engineering.
Alignment and uncertainty compensation: Time-delay compensation, attention, or explicit alignment modules address the critical challenge of synchronizing the far-end reference and microphone signals. In (Jiang et al., 23 Jan 2026), a dedicated attention mechanism with a supervised delay-alignment loss enforces correct alignment, optimizing both echo removal and latency.

3. Loss Functions and Training Objectives

E2E-AEC models are typically trained with global, task-driven loss criteria, including:

Scale-invariant SNR (SI-SNR) and SDR-based losses: These time-domain criteria directly measure echo suppression and are used in virtually all contemporary E2E-AEC systems, often combined with MSE or complex mask MSE in the frequency domain.
Perceptual and auxiliary losses: Integration of perceptually-driven losses, such as PMSQE or auxiliary ASR loss (in (Panchapagesan et al., 2022)), allows models to optimize for perceived speech quality and recognition relevance. Adaptive speech-quality weights (e.g., SDW and ESW in (Liu et al., 2022)) modulate loss terms based on signal characteristics in each TF bin.
Contrastive learning: Supervised InfoNCE contrastive terms encourage invariance to far-end playback variations while maintaining sensitivity to near-end speech, as in (Liu et al., 2022), which empirically leads to higher perceptual quality and more robust separation.
Multi-task objectives: Some frameworks add double-talk detection or VAD prediction losses (e.g., cross-entropy on speech-activity states in (Wang et al., 2021, Ma et al., 2021, Jiang et al., 23 Jan 2026)), integrating explicit speech presence awareness.

4. Progressive, Hybrid, and Multi-stage Strategies

Several E2E-AEC pipelines exploit multi-stage learning, curriculum, or cascaded refinement:

Progressive multi-targeting: E2E-AEC (Jiang et al., 23 Jan 2026) uses progressive learning, first targeting echo-suppressed but noisy speech, then clean speech, with distinct spectral loss terms at each stage.
Cascade and joint modules: MC-TCN (Shu et al., 2021) stacks magnitude and complex mask estimation cores to first resolve strong magnitude residuals, then enhance phase and subtle artifacts.
Multi-task and auxiliary branches: Integrated VAD, double-talk detection, or speech–interference collaboration modules (as in CMNet (Han et al., 2023)) ensure the model is aware of conversational state, minimizing speech distortion during overlapping far-end and near-end activity.
Hybrid adaptation control: End-to-end DNN adaptation of linear adaptive filters, as in (Haubner et al., 2023), shows that learned step-size control via light-weight per-frequency GRU models can outperform both fixed and analytic adaptive rules, especially in rapidly time-varying and double-talk scenarios.

5. Quantitative Evaluation, Deployment, and Open Challenges

The performance of E2E-AEC systems is measured through:

Objective metrics: Echo Return Loss Enhancement (ERLE), Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), scale-invariant source-to-distortion ratio (SI-SDR), and word error rate (WER) for ASR-focused models.
Subjective MOS: Controlled listening tests (e.g., DECMOS, crowd-sourced ITU-P.808) provide benchmarks of perceived quality and echo annoyance, with state-of-the-art models reaching or exceeding MOS 4.4–4.6 in AEC-Challenge datasets (Westhausen et al., 2020, Shu et al., 2021, Jiang et al., 23 Jan 2026).
Resource efficiency: Model sizes range from 354k parameters for convolutional cD3Net (Watcharasupat et al., 2021) to ~10M parameters for LSTM/Conv-based STFT models, all with algorithmic latency typically ≤32 ms and FLOP counts in the sub-GFLOP/s regime.

Open technical challenges include: (1) balancing echo suppression vs. near-end speech distortion, often tuned via mask loss weighting (e.g., the α parameter in (Wang et al., 2021)); (2) compressing models for embedded and mobile deployment without loss of quality (Watcharasupat et al., 2021, Haubner et al., 2023); (3) generalizing to real, highly variable device and room conditions, requiring careful dataset synthesis and augmentation (Westhausen et al., 2020, Panchapagesan et al., 2022); (4) robust operation under heavy nonlinearities, background noise, and spontaneous time-variable delays.

6. Integration with Multi-Channel and Joint Enhancement

E2E-AEC frameworks extend naturally to multi-microphone and joint processing tasks:

Joint beamforming and AEC: Fully end-to-end trainable models now integrate multi-channel filtering, beamforming (MVDR, RNN-based), and AEC with double-talk detection, optimizing for SI-SNR, ASR performance, and subjective quality in reverberant, nonlinear environments (Kothapally et al., 2021, Haubner et al., 2022).
Unified control with DNN controllers: A single neural network can coordinate not only the AEC filter adaptation but also beamforming weight estimation and postfilter application (as in (Haubner et al., 2022)), jointly trained to maximize speech extraction, minimize residual echo/noise, and accelerate convergence during double-talk.

7. Summary Table of Representative E2E-AEC Models

Model/Approach	Core Architecture	Key Innovations	Notable Metrics
NN3A (Wang et al., 2021)	wRLS + DFSMN	Mask-MSE weighting, joint NS/VAD	ERLE 45 dB, PESQ 2.6 (DT)
DTLN-aec (Westhausen et al., 2020)	Stacked LSTM (TF, TD cores)	Dual-path, iLN norm, real-time system	ΔPESQ +0.78, ΔSI-SDR +14.2 dB
E2E-AEC (Jiang et al., 23 Jan 2026)	TF-GridNet GRU, attention align	Progressive learning, VAD masking	ERLE 78.7 dB, MOS_avg 4.51
Complex modular net (Liu et al., 2022)	Conv+GRU modular, complex mask	Contrastive & SQ-aware weighting	PESQ 3.46, ESTOI 0.95
Conformer TasNet (Panchapagesan et al., 2022)	Waveform-domain conformer	ASR-guided loss, small frame hop	WER reduction 56–59% vs linear
EchoFilter (Ma et al., 2021)	TCNN-LSTM, local attention	Double-talk classification auxiliary	ERLE 78 dB, ΔPESQ +1.6
MC-TCN (Shu et al., 2021)	Cascade mag/complex TCN, adaptive	Magnitude → complex masking, dual core	DECMOS 4.41, ERLE > baseline

These models demonstrate the breadth of methodological advances in E2E-AEC, from purely neural architectures to tightly integrated hybrid and control schemes. The current research trajectory continues toward highly efficient, streaming E2E-AEC systems robust to complex echo/noise conditions and suitable for deployment in real-world, resource-constrained devices.