Unified Convolutional Beamforming

Updated 24 March 2026

Unified convolutional beamforming is a spatial filtering technique that consolidates interference suppression, dereverberation, and source separation into one convolutional FIR framework.
It employs convolutional preprocessing and dimensionality reduction to lower computational complexity while maintaining robust performance for multi-user MIMO and speech enhancement applications.
The method seamlessly integrates classical linear beamforming with deep neural modules, offering adaptable solutions for both far-field and near-field scenarios.

Unified convolutional beamforming encompasses a class of spatial filtering strategies that consolidate multiple spatial and/or temporal filtering operations—such as interference suppression, dereverberation, and source separation—into a single convolutional filtering architecture. This paradigm generalizes classical (instantaneous, linear) beamforming by leveraging finite impulse response (FIR) or convolutive filtering, often integrating optimization, machine learning, and neural network methods. Unified convolutional beamforming is central to advanced multi-user MIMO communications, robust speech enhancement, and modern neural source separation pipelines in both time and frequency domains. Distinctive for their ability to jointly address interference mitigation and spatial selectivity at dramatically reduced computational cost, these frameworks unify treatment of far-field and near-field propagation, and readily interface with neural frontends for end-to-end trainability.

1. System Models and Unified Convolutional Beamforming Architectures

Unified convolutional beamforming architectures are grounded in models where array or multi-antenna observations are filtered via spatio-temporal FIR convolutions. In the context of uplink MU-MIMO, the received signal at a base station with an $N$ -element ULA serving $K$ users is: $\mathbf y = \mathbf H\, \mathbf x + \mathbf n$ with $\mathbf x \in \mathbb C^K$ the user symbols, $\mathbf n$ the complex Gaussian noise, and $\mathbf H \in \mathbb C^{N \times K}$ the channel matrix. Two physically distinct models arise:

Far-field (UPW): Channel columns have Vandermonde structure, supporting standard FIR beamspace techniques.
Near-field (USW): Propagation is dominated by spherical waves, resulting in columns lacking Vandermonde structure and requiring more general filter optimization (Feng et al., 2024).

In speech enhancement and array processing, the STFT-domain microphone signals are written: $\mathbf y_t = \mathbf d_t + \mathbf r_t + \mathbf n_t,$ where $\mathbf d_t$ is the desired direct path and early reflections, $\mathbf r_t$ is late reverberation, and $\mathbf n_t$ is additive noise. The convolutional beamformer applies an FIR filter across stacked microphone and delayed observations: $z_{m,t} = \bar{\mathbf h}_m^H\, \bar{\mathbf y}_t$ subject to a distortionless constraint, as in convolutional WPD (Gode et al., 2021) and online variants (Braun et al., 2021).

Neural and hybrid-neural frameworks extract patches or temporal blocks from data, transform them through CNN/U-Net modules, and aggregate outputs either through conventional apodization or additional neural modules, as in ultrasound image beamforming (Bodepudi et al., 2021), deep universal RF beamforming (Nguyen et al., 2022), and end-to-end speech separation (Gu et al., 2022).

2. Convolutional Preprocessing and Dimensionality Reduction

A defining strategy in unified convolutional beamforming is dimensionality reduction and pre-filtering via convolutional beamspace transforms. For MU-MIMO, an FIR-induced Toeplitz matrix $\mathbf W_{\rm cbs}$ compresses the received dimension from $N$ to $N-L+1$ while performing spatial filtering: $\mathbf y_c = \mathbf W_{\rm cbs}\, \mathbf y = \mathbf H_{\rm eff}\, \mathbf x + \mathbf W_{\rm cbs}\, \mathbf n,$ with $\mathbf H_{\rm eff} = \mathbf W_{\rm cbs}\, \mathbf H$ (Feng et al., 2024). The design of the filter $h_c$ is classically achieved via FIR passband/stopband criteria in far-field (Vandermonde) settings; for near-field (arbitrary geometry), filter taps are optimized via convex QCQP formulations subject to passband and stopband constraints.

In STFT-domain dereverberation and denoising, stacked data blocks $\bar{\mathbf y}_t$ allow a single unified FIR filter to approximate both dereverberation (Weighted Prediction Error, WPE) and spatial beamforming (MPDR/MMSE) (Gode et al., 2021). This factorization also underpins online Kalman-APA methods (Braun et al., 2021) and switching spatial clustering extensions (Nakatani et al., 2021).

Convolutional neural approaches similarly operate on local, patch-wise regions or temporal blocks, learning nonlinearly optimized filters that directly absorb statistical or signal priors (Bodepudi et al., 2021, Nguyen et al., 2022, Gu et al., 2022).

3. Unified Linear and Nonlinear Beamforming Solutions

Linear Beamforming After Convolutional Preprocessing

Once signal dimension is reduced and interference rejected via CBS or unified convolutional frontends, classical linear beamforming (e.g., ZF, MMSE) is applied in the low-dimensional effective channel space:

CBS-ZF: $\mathbf W_{\rm zf} = (\mathbf H_{\rm eff}^H\mathbf H_{\rm eff})^{-1}\mathbf H_{\rm eff}^H$
CBS-MMSE: $\mathbf W_{\rm mmse} = (\mathbf H_{\rm eff}^H\mathbf H_{\rm eff}+\bar{\mathbf P}^{-1})^{-1}\mathbf H_{\rm eff}^H$ (Feng et al., 2024)

In unified dereverberation/beamforming, the MPDR/WPD filter is solved via constrained least squares (or its IRLS generalizations for sparse priors): $\bar{\mathbf h}_m^{(i)} = \frac{[\bar{\mathbf R}_y^{\,W^{(i)}}]^{-1} \bar{\mathbf v}_m}{\bar{\mathbf v}_m^H[\bar{\mathbf R}_y^{\,W^{(i)}}]^{-1}\bar{\mathbf v}_m}$ (Gode et al., 2021). Online updates leverage affine projection, reducing computational load to linear in filter dimension (Braun et al., 2021).

Nonlinear and Deep Learning Approaches

Patch-based U-Net transformations (Bodepudi et al., 2021) substitute lighter-weight CNN modules for explicit large-scale MMSE or MVDR apodization. Deep beamformers like DEFORM (Nguyen et al., 2022) use CNNs to estimate relative antenna phase and amplitude ratios, with their outputs injected into MRC combiners, eliminating the need for explicit channel estimation.

Unified all-neural architectures for speech separation couple mask estimation networks (frequency or time-domain) with parametric, time-varying neural beamforming modules that are optimized end-to-end via loss functions such as SI-SDR (Gu et al., 2022).

Switching convolutional beamformer frameworks (swCIVA (Nakatani et al., 2021)) extend classical CBF with adaptive clustering, enabling joint optimization across reverberation/denoising states.

4. Filter Design and Optimization in Far-Field and Near-Field Settings

In classical MU-MIMO far-field regimes, the Vandermonde condition supports FIR filter design via conventional passband/stopband strategies, yielding strong selectivity and sharp interference rejection. In near-field (where uniform spherical wave (USW) effects dominate and the channel matrix loses structure), filter taps must be optimized more generally: $\min_{\mathbf h, t}\; t \;\;\text{s.t.} \begin{cases} \|\mathbf s_i\| \le t & (r_i, \omega_i) \in \mathcal Q_b \ \|\mathbf s_i\| \ge \epsilon_1 & (r_i, \omega_i) \in \mathcal Q_p \ \|\mathbf s_i\| \le \epsilon_2 & (r_i, \omega_i) \in \mathcal Q_t \end{cases}$ with $\mathbf s_i$ a function of the tap vector (Feng et al., 2024). This QCQP formulation can be solved efficiently (e.g., CVX), yielding spatial filters compatible with arbitrary array and propagation geometry.

5. Computational Complexity, Performance, and Practical Gains

A central motivation for unified convolutional beamforming is complexity reduction without significant performance loss:

Complexity Scaling: Classical ZF/MMSE scales as $\mathcal O(K^2 N + K^3)$ ; with well-designed CBS preprocessing and $K_p \ll K$ , complexity reduces to $\mathcal O(K_p^2 (N-L+1) + K_p^3)$ (Feng et al., 2024).
Empirical Performance: CBS-MMSE attains near-optimal SINR with only a slight sum-rate degradation (few percent) and order-of-magnitude CPU savings. Patch-based U-Net approaches for imaging achieve contrast and edge fidelity on par with full MVDR but at $\sim$ 1.2 $\times$ the speed of DAS (Bodepudi et al., 2021). APA/Kalman solutions for STFT-domain dereverberation/denoising reduce cost by 20–100 $\times$ versus RLS-based approaches (Braun et al., 2021).
Robustness and Universality: DEFORM demonstrates 2–4 dB SNR improvements in both cable and over-the-air RF settings, even in NLOS, and massive PLR/PDR gains for LoRa and ZigBee relaying (Nguyen et al., 2022). All-neural time-domain beamformers exceed or closely match theoretical bounds of analytic MVDR/MCWF and generalize across domains and source types (Gu et al., 2022).

6. Integration with Machine Learning and End-to-End Neural Approaches

Unified convolutional frameworks naturally absorb neural modules in beamforming and spatial filtering:

Patch-based U-Net modules learn nonlinear local RF-patch mappings, integrated with standard DAS apodization (Bodepudi et al., 2021).
Deep beamforming networks output parametric, time-varying weights, supporting direct optimization from waveform (or STFT) input to SI-SDR criterion (Gu et al., 2022).
Convolutional neural frontends in RF tasks directly infer relative channel parameters for MRC, eschewing pilots and explicit model fitting (Nguyen et al., 2022).
Switching convolutional frameworks can leverage neural-derived masks for spatial guidance and initialization (Nakatani et al., 2021).

7. Applications, Limitations, and Extensions

Unified convolutional beamforming finds application across MIMO comms (massive MIMO, IUI suppression), robust speech enhancement (dereverberation, denoising), neural source separation, and medical imaging (ultrasound, MRI RF pipelines). It generalizes to handle arbitrary array geometries, far/near-field propagation, and time-varying, nonstationary scenes. Limitations may arise in extremely high reverberation or severe model mismatch (e.g., rapidly time-varying environments), where filter adaptation speed and spatial resolution constrain performance. Ongoing research includes the joint design of spatial and spectral neural features, adaptive and data-driven filter length selection, and model-free beamspace architectures (Feng et al., 2024, Bodepudi et al., 2021, Gu et al., 2022).