Complex LSTM Recurrent Block
- Complex LSTM recurrent block is an advanced module that processes complex-valued time-frequency data by jointly modeling amplitude and phase information.
- It is applied in binaural speech enhancement to improve spatial cues, such as interaural time and level differences, leading to clearer audio signals.
- The block generalizes LSTM equations to the complex domain using adapted activations, ensuring coherent phase evolution and robust performance in dynamic acoustic scenes.
A complex LSTM recurrent block is an architectural variant of the conventional long short-term memory (LSTM) module in recurrent neural networks (RNNs), specifically designed to operate on complex-valued data. In the context of advanced audio and speech processing tasks—such as binaural speech enhancement—complex LSTM blocks are structured to model and propagate both magnitude and phase information jointly within the time-frequency (TF) domain. This joint modeling is critical for tasks that depend on fine-grained signal phase characteristics, including preservation and enhancement of spatial auditory cues such as interaural time differences (ITD) and interaural level differences (ILD) (Tokala et al., 26 Jul 2025).
1. Motivation for Complex-Valued Recurrent Blocks
In standard speech enhancement pipelines, RNN architectures—such as LSTM and GRU—typically process the magnitude spectrogram of audio signals while often ignoring or simply recycling the (noisy) phase. For monaural enhancement, this can be partially sufficient, but for applications where spatial hearing is paramount, such as binaural audio processing, the interaural phase and level differences in the TF domain are essential for accurate localization and for maintaining a natural spatial audio scene (Tokala et al., 26 Jul 2025). Real-valued recurrent blocks are fundamentally limited in their capacity to preserve these cues. The introduction of complex LSTM blocks addresses this limitation by operating natively on complex-valued TF representations, learning transformations for both amplitude and phase.
2. Mathematical Structure of the Complex LSTM Block
The canonical LSTM equations are generalized to the complex domain. For each time step :
- Input: (complex TF features)
- Hidden state:
- Cell state:
The gating operations and candidate update are computed as follows:
where all weights and , biases , cell states , and activations are complex-valued, and denotes element-wise multiplication (Tokala et al., 26 Jul 2025). The activation functions— for the sigmoid and —must be adapted to operate over complex numbers. Typical approaches implement these by separately applying the function to real and imaginary parts, or by leveraging alternative formulations such as magnitude–phase parameterizations.
3. Signal Domain and Recurrent Learning
The block is integrated into architectures that process short-time Fourier transform (STFT) representations of multichannel audio, ensuring that both input and all internal states remain complex throughout the encoder–LSTM–decoder pipeline. This allows the LSTM recurrence to model signal transformations that involve coherent phase evolution as well as amplitude changes across time frames.
The block is trained to output complex ratio masks (CRM) for each output channel (e.g., left and right in binaural settings) that are applied directly in the TF domain to reconstruct enhanced signals. This approach enables the recovery of speech with improved intelligibility and, crucially, with preserved or enhanced interaural cues.
4. Preservation of Spatial Auditory Cues
In binaural audio, the perceptual localization of sound sources relies on the maintenance of subtle spectral and phase relationships between channels. By employing a complex-valued recurrent block, the network captures temporal dependencies in both amplitude and phase. Empirically, this design leads to more consistent preservation of interaural differences across time, resulting in enhanced spatial audio scenes after denoising (Tokala et al., 26 Jul 2025).
Such preservation is unattainable for real-valued LSTMs that estimate only magnitude masks while possibly introducing phase inconsistencies or artifacts if the noisy phase is reused. The effectiveness of this joint modeling is demonstrated through significant improvements in both objective spatial metrics and speech intelligibility scores when compared to real-valued or magnitude-only baselines, such as BSOBM and BiTasNet (Tokala et al., 26 Jul 2025).
5. Training and Implementation Considerations
The operational complexity of complex LSTM blocks arises from the need to manage complex-valued weights and activations. Specialized initialization, normalization, and activation function designs are required to mitigate potential issues such as exploding or vanishing gradients in the complex domain. For example, implementation may utilize split activations (applying the function to the real and imaginary components separately) or magnitude-phase decompositions for both gates and candidate cell updates.
Despite the increased computational and memory demands compared to real-valued variants, practical deployment is justified by robust performance: enhanced system outputs with reduced noise and better spatial fidelity. The architecture incorporates an encoder-decoder (often convolutional) backbone, with the complex LSTM block situated as a bottleneck module bridging the encoded feature maps back to the reconstructed output.
6. Comparison to Alternative Architectures
Baseline methods for binaural enhancement often estimate real masks and treat phase separately, leading to suboptimal spatial reconstruction. The complex LSTM block offers a unified framework that simultaneously handles amplitude and phase, yielding superior results both in terms of speech clarity and preservation of localization cues (Tokala et al., 26 Jul 2025).
In contrast, approaches that ignore phase (or reconstruct it in post-processing) cannot match the preservation of interaural spectral differences provided by complex-valued recurrence. A further distinction is that complex LSTM accommodates the temporal evolution of spatial features, which is critical in dynamic acoustic scenes.
7. Summary of Empirical Findings
Experimental analysis reveals that complex LSTM recurrent blocks in binaural speech enhancement systems achieve:
- Improved estimated speech intelligibility
- Superior noise suppression relative to real-valued baseline models
- Preserved spatial information, manifest in better retention of ITD and ILD cues
- Robustness across a range of noise conditions, including isotropic noise and single-target scenarios (Tokala et al., 26 Jul 2025)
While challenges remain in the efficient training and deployment of complex LSTM architectures—particularly in designing stable complex-valued activations—the empirical benefits for spatial hearing applications are substantial. This suggests that complex recurrent blocks are a critical enabling technology in advanced binaural audio and speech enhancement systems.