2000 character limit reached

Parallel-EfficientNet-CBAM-LSTM (PECL) Network

Updated 14 November 2025

The paper introduces PECL, a radar-based activity recognition network that utilizes parallel EfficientNet backbones with CBAM attention and LSTM modules to capture temporal dynamics.
It processes Range-Time, Doppler-Time, and Range-Doppler spectrograms in separate branches, ensuring complementary feature extraction and improved action discrimination.
Experimental validation on FMCW radar data yields 96.16% accuracy, outperforming CNN and transformer models while maintaining a moderate computational footprint.

The Parallel-EfficientNet-CBAM-LSTM (PECL) network is a heterogeneous parallel multi-domain architecture developed for high-precision, radar-based human activity recognition. By integrating EfficientNet-B0 backbones augmented with Convolutional Block Attention Modules (CBAM) and sequence modeling via LSTM units, PECL exploits the complementary dynamics of Range-Time, Doppler-Time, and Range-Doppler spectrogram representations. The design is motivated by the need to enhance discriminability among visually similar actions and to robustly capture temporal dependencies in radar micro-Doppler data, all while maintaining moderate parameter and computation budgets (Yan et al., 7 Nov 2025).

1. Network Architecture

PECL features three parallel branches, each employing an EfficientNet-B0 backbone with its original Squeeze-and-Excitation (SE) modules replaced by CBAM blocks (Woo et al., 2018). Each branch processes a distinct spectrogram domain:

Range–Time (RT) branch: Handles time-ordered range profiles, mapping (slow time, fast time) radar data to sequences representing target movement over several chirps.
Doppler–Time (DT) branch: Processes time-varying Doppler profiles, capturing velocity fluctuations within a selected range bin window.
Range–Doppler (RD) branch: Models the spatial structure of motions by encoding 2D range-Doppler images, typically lacking a direct time sequence.

Each branch starts with a 3×3 convolutional stem, followed by a stack of CBAM-enhanced MBConv blocks, culminating in a 1,280-channel pointwise convolution.

For the RT and DT streams, the resulting $F\in\mathbb{R}^{B\times C\times H\times W}$ is reshaped to a temporal sequence $S\in\mathbb{R}^{B\times T\times D}$ , where $T=W$ (temporal steps) and $D=C\cdot H$ (features per step). This is fed into a single-layer LSTM (hidden size 128), producing $f_\mathrm{RT}$ and $f_\mathrm{DT}$ .

The RD branch, due to its absence of an explicit sequential axis, flattens its spatial feature map and processes it through a fully connected layer (to 8,960 dimensions), followed by max-pooling across the pseudo-time dimension, yielding $f_\mathrm{RD}$ .

The outputs $f_\mathrm{RT}, f_\mathrm{DT}, f_\mathrm{RD}$ (each 128-dim) are concatenated and regularized with dropout (p=0.2) to form a 384-dim feature for classification into six human actions via a final linear-softmax layer. The overall architecture is visualized in Figure 1 of (Yan et al., 7 Nov 2025).

2. Core Mathematical Components

CBAM Attention Integration

CBAM is a lightweight two-stage attention block augmenting each MBConv unit:

Channel Attention: For feature $F\in\mathbb{R}^{C\times H\times W}$ , channel weights are computed as

$M_c(F) = \sigma\bigl(\mathrm{MLP}(\mathrm{AvgPool}(F))+\mathrm{MLP}(\mathrm{MaxPool}(F))\bigr)\in\mathbb{R}^{C\times1\times1},$

with a two-layer MLP bottleneck (hidden $C/16$ ), and $\sigma$ the sigmoid.

Spatial Attention: Channel-refined features undergo spatial gating:

$M_s(F) = \sigma\Bigl(\mathrm{Conv}^{7\times7}\bigl([\mathrm{AvgPool}_c(F);\mathrm{MaxPool}_c(F)]\bigr)\Bigr)\in\mathbb{R}^{1\times H\times W}.$

Attention is applied as $F' = M_s(F)\odot(M_c(F)\odot F)$ .

Spectrogram Generation

Range-Time Map: Derived by windowed FFT along fast time, followed by HP filtering and log-magnitude:

$S_\mathrm{RT}[n,r] = \sum_{m=0}^{N_s-1} s[n,m]\,w[m]\,e^{-j\,2\pi\,r\,m/N_s}$

Stationary clutter is removed, and the magnitude is mapped as:

$\mathcal{R}[n,r] = 20\log_{10}|S_\mathrm{RT-MTI}[n,r]|$

Doppler-Time Map: Computed using ASTFT over slow time for selected range bins.
Range-Doppler Map: FFT along slow time axis of RT-MTI generates $\mathcal{M}[r, f_D]$ .

LSTM Modeling

RT and DT features are modeled by LSTM with standard cell equations: $\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i)\,, \ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f)\,, \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o)\,, \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c)\,, \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\,, \ h_t &= o_t \odot \tanh(c_t)\,. \end{aligned}$

Fusion is via concatenation: $f_\mathrm{fused} = \mathrm{Dropout}\bigl([f_\mathrm{RT}\|f_\mathrm{DT}\|f_\mathrm{RD}]\bigr)\in\mathbb{R}^{384}$ and classification is

$\hat y = \mathrm{Softmax}(W_\mathrm{cls}f_\mathrm{fused}+b_\mathrm{cls}),$

where $W_\mathrm{cls}\in\mathbb{R}^{6\times384}$ .

3. Feature Fusion and Temporal Processing

The tri-domain fusion strategy is central to PECL’s efficacy. RT and DT streams explicitly model temporal structure inherent in radar signatures using LSTM units. For those, feature flattening and temporal sequencing ( $T=W$ frames per sample) ensure that subtle action dynamics (e.g., arm swings or micro-motions) are preserved.

The RD branch eschews LSTM modeling, as its 2D composition lacks a native time axis. Instead, a max-pooling operator across frequency bins captures the most salient frequency-spatial cues.

Fusion is implemented by simple concatenation of the three 128-dim vectors from each branch, preserving independent domain-specific information before joint prediction. Dropout is applied to the fused vector to mitigate overfitting before the final fully connected classifier.

4. Model Complexity and Implementation

PECL achieves a parameter count of 23.42M and computational load of 1324.82M FLOPs. This computation results from:

EfficientNet-B0 backbone per branch: ≈5.29M parameters
CBAM overhead: ≈0.1M per branch (added negligible computation—~0.03 GFLOPs per block (Woo et al., 2018))
LSTM layers: ≈0.07M per branch (RT, DT only)
RD max-pool path: fully connected to 8,960 dims, then pool and reduce to 128-dim
Final fusion and linear classifier

Hyperparameters align with backbone conventions: channel attention bottleneck $r=16$ , CBAM spatial convolution kernel $7\times7$ , He initialization, dropouts, Adam optimizer ( $\textrm{lr}=1\textrm{e}{-3}$ , decayed every 30 epochs over 300 epochs), and batch normalization. Pretrained EfficientNet weights are reused where possible. Training uses cross-entropy loss and data-augmented spectrograms with additive Gaussian noise tailored to regional power.

5. Experimental Validation

Tests were conducted on the University of Glasgow FMCW radar dataset (5.8 GHz, 400 MHz, 6 classes). PECL yields 96.16% overall accuracy, outperforming competing baselines (classical CNNs, transformer-based models) by ≥4.78% and using orders-of-magnitude fewer FLOPs than transformer alternatives. Performance is class-uniform: walking, sitting down, standing up, and fall exceed 98.6% accuracy. Notably, PECL achieves high discrimination in the most confused pairs—‘Pick Up’ vs. ‘Drink’—with ~10–15% reduction in mutual misclassification.

Ablation analysis reveals:

Parallel multi-domain fusion is superior to single-branch approaches (max gain +17.54% over weakest mono-domain).
CBAM modules consistently outperform SE attention (+1% per branch), and sequential channel→spatial attention outperforms other arrangements (Woo et al., 2018).
LSTM increases temporal discrimination in RT and DT; its inclusion in RD is detrimental.
Max-pooling in the RD branch’s projection pathway is essential (avg-pool degrades accuracy to 78.98%).

6. Impact and Methodological Distinctions

PECL establishes that spatial-temporal attention and heterogeneous, domain-specific parallelism are critical for robust radar-based activity recognition in privacy-sensitive settings. Its architecture accepts simultaneous complementary views of spatiotemporal radar signatures, applies lightweight CBAMs for both channel (“what”) and spatial (“where”) valorization, and maximally exploits temporal dynamics where available.

Key features distinguishing PECL include:

Joint channel-spatial CBAM attention per branch, consistently surpassing SE-only or no-attention designs (Woo et al., 2018)
Explicit temporal modeling using LSTM units in streams with time ordering, and explicit avoidance of LSTM in spatial-only streams, matching the ablation findings
Efficient resource usage: moderate parameter count and FLOPs
Universality: the design can generalize to other multi-modal, multi-domain sensor fusion problems where modality-complementary temporal dependencies must be robustly and efficiently encoded

A plausible implication is that extending such heterogeneous, attention-enhanced, parallel architectures may benefit a wider set of sequence-based sensor recognition tasks, provided that fusion, attention, and temporal modeling strategies are adapted to each domain's structural idiosyncrasies.

PDF Markdown Chat (Pro)

References (2)

PECL: A Heterogeneous Parallel Multi-Domain Network for Radar-Based Human Activity Recognition (2025)

CBAM: Convolutional Block Attention Module (2018)

Follow Topic

Get notified by email when new papers are published related to Parallel-EfficientNet-CBAM-LSTM (PECL) Network.