End-to-End Waveform Processing

Updated 14 September 2025

End-to-end waveform processing is a deep learning approach that learns task-specific features directly from raw signal inputs without relying on fixed, hand-crafted transforms.
It employs multi-scale convolutional architectures that balance temporal and frequency resolution, yielding measurable gains in speech recognition, source separation, and synthesis.
Adaptive, interpretable filter learning and unified end-to-end optimization improve performance while introducing challenges such as increased computational cost and data demands.

End-to-end waveform processing refers to a class of deep learning methodologies in which neural architectures are designed to operate directly on raw waveform signals, learning hierarchical and task-specific features without recourse to traditional, hand-crafted intermediate representations such as spectrograms, Fourier transforms, or other signal-processing-derived features. This paradigm has been successfully adopted across diverse domains, including automatic speech recognition (ASR), source separation, text-to-speech (TTS) synthesis, music information retrieval, radar, and wireless communication, and is characterized by its capacity to jointly optimize all processing stages in a task-driven manner.

1. Architectural Foundations and Feature Extraction

Conventional signal processing pipelines apply fixed transformations (e.g., short-time Fourier transform, mel-filterbanks, or MFCC) early in the workflow, decoupling feature extraction from the task-dependent learning. In contrast, end-to-end waveform processing architectures learn all transformations directly from the data. The foundational building block is typically a convolutional front end, which replaces fixed analytic filters with learnable kernels applied to the waveform:

$y[t] = \sum_{n} w[n] x[t-n]$

where $x[t]$ is the input signal and $w[n]$ denotes a learnable filter. Importantly, the stride (downsampling interval) and filter length can be decoupled, enabling independent control of temporal and frequency resolution (Zhu et al., 2016). For instance, reducing the stride (e.g., to sub-millisecond values) can preserve high temporal resolution, while increasing the number or length of filters supports improved frequency selectivity.

Modern variants include multi-scale processing where multiple banks of convolutional filters, each with distinct window sizes and strides, are applied in parallel to represent various temporal and spectral scales. This architecture allows, for example, dedicated high-frequency filters (short window/small stride) and low-frequency filters (long window/large stride), whose outputs are subsequently downsampled to a common frame rate for further processing. Multi-scale approaches have demonstrated substantial improvements over single-scale or spectrogram-based front ends, notably achieving a 20.7% relative reduction in word error rate (WER) on difficult speech tasks (Zhu et al., 2016).

2. Domain-Specific Specialization and Task-Driven Architectures

The flexibility of end-to-end waveform models enables their adaptation to diverse tasks:

Speech and Music Recognition: In ASR and music auto-tagging, networks composed of deep stacks of short (e.g., 3×1) filters ("sample-level convolution") enable hierarchy learning from the waveform. The choice of filter shape and stack depth allows the network to capture both low-level periodicities and high-level structures (Pons et al., 2017).
Source Separation: Raw waveform separation networks have replaced STFT-based pipelines using convolutional (and transposed convolutional) layers trained to extract latent modulations and phase-like carriers directly. Masking approaches, in which the model predicts and applies time-frequency masks to disentangle source signals in the latent domain, further enhance both objective and perceptual separation performance (Venkataramani et al., 2018).
Waveform Synthesis and TTS: Autoregressive and normalizing-flow-based models now generate speech waveforms directly from text, learning block-wise non-overlapping waveform segments without recourse to intermediate acoustic representations (e.g., spectrograms). These models optimize the negative log-likelihood of the raw waveform, capturing both short and long-term dependencies in a unified fashion (Weiss et al., 2020).
Acoustic Scene and Sound Source Analysis: Specialized convolutional front ends, such as interpretable FIR-conv layers parameterized by Sinc functions and windowing (Vu et al., 3 May 2024), have demonstrated efficacy in domains ranging from speech emotion recognition to biomedical waveform analysis, with the added benefit of filter interpretability.

3. Performance Advantages and Trade-Offs

Across multiple application domains, end-to-end waveform-based models have established new baselines and in certain cases delivered superior performance compared to traditional, hand-crafted front ends.

Domain	Baseline (Traditional)	End-to-End Waveform Gain
Speech Rec.	Mel-spectrogram, MFCC	WER improved by 7.9–28.1% rel. (Lam et al., 2021, Zeghidour et al., 2018)
Source Sep.	STFT-domain with masking	Subjective/objective separation ↑ (Venkataramani et al., 2018, Lluís et al., 2018)
Music Tagging	Log-mel, domain-kernel CNNs	Outperforms with large data (Pons et al., 2017)
TTS	Tacotron+vocoder cascade	MOS near state-of-art, faster synth (Weiss et al., 2020)
SER/Health	MFCC, Mel-spec + classifier	UA up to 7% ↑, F1 >92% (Vu et al., 3 May 2024)

Superior performance is typically realized when architectures are allowed to exploit large datasets that support the increased parameterization and reduced inductive bias inherent in waveform-based models. For music tagging, waveform models outperform domain-kernel spectrogram models when trained on >1 million examples, but perform suboptimally in small-data scenarios, where the absence of domain knowledge hampers generalization (Pons et al., 2017). Conversely, in ASR and waveform source separation, end-to-end designs enable joint learning of time-frequency decomposition, masking, and denoising, yielding measurable gains in both objective metrics (e.g., WER, SI-SNR, F1) and listener preference.

This flexibility does come at computional cost. Decoupling stride and filter size for improved temporal or frequency resolution increases both the memory and FLOPS budget of the front-end convolutional layers. Mitigation involves aggressive max-pooling, feature bottleneck layers, or hybrid architectures (e.g., merging learned features with analytic transforms) (Zhu et al., 2016).

4. Adaptive and Interpretable Feature Learning

A defining property of end-to-end waveform models is their capacity to discover task-optimal features. Unlike spectrogram-based pipelines, which impose fixed time-frequency trade-offs, convolutional front ends can learn spectral transformations tuned to the task and dataset, evidenced by:

Specialization of multiscale filter banks for specific frequency bands (Zhu et al., 2016, Lam et al., 2021).
Adaptive bandpass filtering tightly linked to classification or regression outcomes, with interpretable learned window functions and spectral regions (Vu et al., 3 May 2024).
Data-driven frequency analysis, where entirely learnable kernels supplant fixed auditory-inspired filterbanks (Vecchiotti et al., 2019).

The learned filters are often visualizable and physically interpretable, facilitating diagnosis and trust in critical domains such as heart sound detection and biomedical monitoring.

5. Methodological Innovations: Training, Losses, and Constraints

End-to-end waveform processing has spurred novel loss functions, constraints, and optimization methods suited to high-dimensional, physically grounded waveforms:

Composite Cost Functions: In source separation, waveform-centric losses composed of SDR, SIR, SAR, and STOI terms directly reflect the desired perceptual and interference-suppression qualities (Venkataramani et al., 2018).
Augmented Lagrangian/Constraint Handling: In RF communications, joint optimization of pulse shaping and constellation geometry (subject to energy, spectral leakage, and power envelope constraints) is achieved via an augmented Lagrangian setup, enabling order-of-magnitude improvements in out-of-band emission (Aoudia et al., 2021).
Reinforcement Learning and Alternating Optimization: Joint transmitter-receiver design in radar or wireless power transfer systems is performed via alternating supervised (detector) and reinforcement learning (waveform), adapting waveforms to channel statistics and environmental constraints (Jiang et al., 2019, Jiang et al., 2021, Khattak et al., 9 May 2024).
Multi-step and Parallel-in-Time Training: For numerical simulation, end-to-end models integrating coarse solvers with upsampling networks employ multi-step temporal loss schedules and parallel-in-time correction (Parareal) for stable and efficient high-fidelity wave propagation (Kaiser et al., 4 Feb 2024).

These approaches couple domain-specific priors, optimization strategies, and end-to-end training to achieve robust and efficient inference directly on raw waveforms.

6. Application Domains and Broader Implications

End-to-end waveform processing has demonstrated impact across:

Speech: ASR, TTS, speaker identification, source separation, speech enhancement.
Music: Tagging, instrument/vocal source separation, high-fidelity singing synthesis (Zhang et al., 2022).
Biomedical: Heart sound classification, respiratory waveform estimation via diffusion models (Miao et al., 6 Oct 2024).
Communications/Radar: Wireless power transfer waveform and beamforming, joint waveform/detector design, robust operation under hardware/infrastructure and spectral constraints (Jiang et al., 2019, Khattak et al., 9 May 2024).
Scientific Computing: Accelerated (in-time and in-space) numerical wave propagation with neural upsampling correctors (Kaiser et al., 4 Feb 2024).

The broader methodological shift toward raw waveform input not only enables direct exploitation of all information (including phase, fine-scale timing, and signal higher-order statistics), but also paves the way for architectures capable of scale-adaptive, task-specific, and interpretable feature extraction. In domains with sufficient data and computation, this suggests a trajectory wherein explicit feature engineering is supplanted by integrated, learnable, and domain-adaptive front ends. Nevertheless, domain-specific inductive biases remain useful in low-resource settings or where interpretability and regulatory constraints require explicit control, as found in hybrid approaches and constrained kernel architectures.

7. Limitations, Challenges, and Future Prospects

Despite strong empirical results, several challenges persist:

Data and Computation Demands: Performance gains are most pronounced in large-data regimes; under low data or resource constraints, domain knowledge (e.g., spectrogram priors) can outperform fully learnable front ends (Pons et al., 2017).
Real-Time and Resource Limits: Supporting high-throughput or latency-critical applications requires either architectural pruning, bottlenecks, or hardware co-design.
Preventing Artifacts: Direct waveform models, particularly in source separation or synthesis (TTS, SVS), can introduce artifacts or fail to impose consistent phase or energy statistics. Hybrid architectures that introduce analytic priors (e.g., DSP synthesizers in singing synthesis (Zhang et al., 2022)) or loss regularization (spectral loss terms, phase constraints) have proven effective countermeasures.
Interpretability and Trust: With the increasing deployment of end-to-end models in critical applications, avenues for visualizing and constraining learned features become essential (Vu et al., 3 May 2024).

Continued research is likely to expand upon multi-scale, interpretable, and hybrid architectures, as well as the incorporation of domain knowledge through differentiable signal processing layers and physics-inspired constraints. The shifting boundary between hand-crafted signal analysis and fully learnable front ends remains central to the next phase of audio, speech, and waveform modeling research.