Lightweight Attentive Beamforming Network (LABNet)
- The paper introduces LABNet, a lightweight framework that leverages a three-stage processing and cross-channel attention to achieve microphone invariance in real-time speech enhancement.
- LABNet efficiently fuses intra- and inter-channel features using a modular design that adapts to arbitrary microphone arrays while reducing computational complexity.
- Experimental results demonstrate superior performance with a PESQ of 2.92 and STOI of 0.93, making LABNet ideal for edge devices and dynamic audio environments.
A Lightweight Attentive Beamforming Network (LABNet) is an efficient deep learning framework for real-time multichannel speech enhancement that is specifically engineered for ad-hoc microphone arrays and stringent computational environments (Yan et al., 22 Jul 2025). LABNet is characterized by a novel three-stage processing architecture built around a cross-channel attention (CCA) module, providing inherent microphone invariance (MI). This architecture allows the system to aggregate acoustic information across arbitrary numbers and configurations of microphones, making it suitable for edge devices and scenarios where array geometry and channel count are not fixed. Its key contribution is the integration of low-complexity attentive mechanisms and a data-efficient fusion of intra- and inter-channel features, establishing a new standard for multichannel speech enhancement under ad-hoc array conditions.
1. Motivation and Problem Formulation
LABNet directly addresses the challenges unique to real-time multichannel speech enhancement in variable and often ad-hoc array topologies. The primary objectives are:
- Microphone invariance: Enable processing with arbitrary microphone numbers and configurations without requiring model retraining or architecture changes.
- Lightweight operation: Ensure suitability for edge-device deployments by minimizing parameters and computational complexity, thereby facilitating real-time operation under limited resource constraints.
- Robust speech enhancement: Exploit spatial and temporal correlations present in multichannel audio for improved speech quality and intelligibility.
Formally, the observed signal for channel is modeled as
where is the target speech, the room impulse response for channel , additive noise, and the (potentially variable) number of microphones.
2. Three-Stage Processing Architecture
LABNet’s technical core consists of a three-stage processing pipeline explicitly designed to disentangle and fuse local and global features for robust enhancement:
Stage 1: Channel-wise Processing
- Each channel’s spectro-temporal representation is independently encoded via shared convolutional layers.
- Channel features are processed by a dual-path recurrent (DPR) module to capture both time and frequency context.
- The cross-channel attention module aggregates these into a unified latent “reference” feature vector by selectively pooling the most informative intra-channel cues.
Stage 2: Pair-wise Alignment
- The reference vector is concatenated to each channel’s original processed feature.
- A shared linear layer and additional DPR processing are applied, aligning timing and phase differences between microphones.
- A second CCA module aggregates this set into an enhanced reference representation capturing pair-wise relationships.
Stage 3: Post-Refinement
- The final DPR module refines into , emphasizing informative spatial and temporal dependencies prior to decoding.
- The decoder estimates a magnitude spectrum mask for the reference channel, used for signal reconstruction.
The design ensures all three stages preserve information necessary for effective inter-channel fusion while minimizing redundancy.
3. Cross-Channel Attention Module
The cross-channel attention (CCA) module is central to microphone invariance and adaptive feature aggregation:
- Inputs: (batch, freq/time, channels, features).
- Mechanism:
- Query (): Derived from the normalized reference channel feature .
- Keys (), Values (): Formed from all channels following normalization and linear transformation.
- Output: Multi-head attention (MHA) is computed as
- This MHA adaptively and selectively aggregates relevant channel contributions, naturally scaling to variable due to the attention pooling’s permutation and dimension invariance.
Pseudocode (editor's term for illustration):
1 2 3 4 |
Q = Linear(LN(h_1_in)) K = Linear(LN(h_in)) # All channels V = Linear(LN(h_in)) h_out = MHA(Q, K, V) + h_1_in |
4. Microphone Invariance and Ad-Hoc Array Adaptability
LABNet’s architectural invariance to microphone count and geometry arises from:
- Parameter tying across channel encoders and processing blocks.
- CCA’s use of attention pooling, which does not require a fixed channel order or count.
- The entire network’s design avoids explicit assumptions about array geometry, unlike classical beamformers with geometry-specific steering vectors.
As a result, LABNet may be deployed in scenarios with:
- Variable numbers of microphones at runtime.
- Non-uniform or dynamically changing array topologies.
- Real-world ad-hoc environments, such as wearable arrays or distributed smart home microphones.
5. Computational Complexity and Real-Time Suitability
LABNet achieves significant resource efficiency:
- Model size: ~52k parameters.
- Compute: 0.316G multiply-accumulate operations (MACs) per forward pass.
- Latency: 64 ms (including a single Griffin-Lim phase refinement iteration).
These characteristics position LABNet for:
- Real-time enhancement on edge devices (e.g., smart home hubs, mobile hardware).
- Embedded applications with low memory and compute budgets.
- Latency-constrained interactive systems.
6. Experimental Results and Comparative Performance
In evaluation with six microphones:
- PESQ: 2.92
- STOI: 0.93
Compared to classic and DNN-based methods such as Oracle MVDR, EaBNet, and McNet, LABNet demonstrates superior enhancement performance at a fraction of the computational cost. Ablation studies confirm the necessity of each three-stage component, particularly the cross-channel attention modules, for optimal results.
7. Applications and Significance
The flexibility, efficiency, and robustness of LABNet make it suitable for:
- Wearable and consumer devices: Adaptive speech enhancement for dynamic personal audio.
- Ad-hoc conferencing: Microphone-invariant enhancement in variable meeting spaces.
- Robot audition and multi-agent sensing: Real-time speech processing with movable or distributed sensor arrays.
- General audio processing infrastructure where array characteristics cannot be standardized a priori.
LABNet’s design strategy—modular multi-stage processing and attention-driven inter-channel fusion—constitutes a foundation for future development of lightweight, invariant deep beamforming networks adaptable to evolving requirements in real-world, resource-limited environments (Yan et al., 22 Jul 2025).