Parallel-EfficientNet-CBAM-LSTM (PECL) Network
- The paper introduces PECL, a radar-based activity recognition network that utilizes parallel EfficientNet backbones with CBAM attention and LSTM modules to capture temporal dynamics.
- It processes Range-Time, Doppler-Time, and Range-Doppler spectrograms in separate branches, ensuring complementary feature extraction and improved action discrimination.
- Experimental validation on FMCW radar data yields 96.16% accuracy, outperforming CNN and transformer models while maintaining a moderate computational footprint.
The Parallel-EfficientNet-CBAM-LSTM (PECL) network is a heterogeneous parallel multi-domain architecture developed for high-precision, radar-based human activity recognition. By integrating EfficientNet-B0 backbones augmented with Convolutional Block Attention Modules (CBAM) and sequence modeling via LSTM units, PECL exploits the complementary dynamics of Range-Time, Doppler-Time, and Range-Doppler spectrogram representations. The design is motivated by the need to enhance discriminability among visually similar actions and to robustly capture temporal dependencies in radar micro-Doppler data, all while maintaining moderate parameter and computation budgets (Yan et al., 7 Nov 2025).
1. Network Architecture
PECL features three parallel branches, each employing an EfficientNet-B0 backbone with its original Squeeze-and-Excitation (SE) modules replaced by CBAM blocks (Woo et al., 2018). Each branch processes a distinct spectrogram domain:
- Range–Time (RT) branch: Handles time-ordered range profiles, mapping (slow time, fast time) radar data to sequences representing target movement over several chirps.
- Doppler–Time (DT) branch: Processes time-varying Doppler profiles, capturing velocity fluctuations within a selected range bin window.
- Range–Doppler (RD) branch: Models the spatial structure of motions by encoding 2D range-Doppler images, typically lacking a direct time sequence.
Each branch starts with a 3×3 convolutional stem, followed by a stack of CBAM-enhanced MBConv blocks, culminating in a 1,280-channel pointwise convolution.
For the RT and DT streams, the resulting is reshaped to a temporal sequence , where (temporal steps) and (features per step). This is fed into a single-layer LSTM (hidden size 128), producing and .
The RD branch, due to its absence of an explicit sequential axis, flattens its spatial feature map and processes it through a fully connected layer (to 8,960 dimensions), followed by max-pooling across the pseudo-time dimension, yielding .
The outputs (each 128-dim) are concatenated and regularized with dropout (p=0.2) to form a 384-dim feature for classification into six human actions via a final linear-softmax layer. The overall architecture is visualized in Figure 1 of (Yan et al., 7 Nov 2025).
2. Core Mathematical Components
CBAM Attention Integration
CBAM is a lightweight two-stage attention block augmenting each MBConv unit:
- Channel Attention: For feature , channel weights are computed as
with a two-layer MLP bottleneck (hidden ), and the sigmoid.
- Spatial Attention: Channel-refined features undergo spatial gating:
Attention is applied as .
Spectrogram Generation
- Range-Time Map: Derived by windowed FFT along fast time, followed by HP filtering and log-magnitude:
Stationary clutter is removed, and the magnitude is mapped as:
- Doppler-Time Map: Computed using ASTFT over slow time for selected range bins.
- Range-Doppler Map: FFT along slow time axis of RT-MTI generates .
LSTM Modeling
RT and DT features are modeled by LSTM with standard cell equations:
Fusion is via concatenation: and classification is
where .
3. Feature Fusion and Temporal Processing
The tri-domain fusion strategy is central to PECL’s efficacy. RT and DT streams explicitly model temporal structure inherent in radar signatures using LSTM units. For those, feature flattening and temporal sequencing ( frames per sample) ensure that subtle action dynamics (e.g., arm swings or micro-motions) are preserved.
The RD branch eschews LSTM modeling, as its 2D composition lacks a native time axis. Instead, a max-pooling operator across frequency bins captures the most salient frequency-spatial cues.
Fusion is implemented by simple concatenation of the three 128-dim vectors from each branch, preserving independent domain-specific information before joint prediction. Dropout is applied to the fused vector to mitigate overfitting before the final fully connected classifier.
4. Model Complexity and Implementation
PECL achieves a parameter count of 23.42M and computational load of 1324.82M FLOPs. This computation results from:
- EfficientNet-B0 backbone per branch: ≈5.29M parameters
- CBAM overhead: ≈0.1M per branch (added negligible computation—~0.03 GFLOPs per block (Woo et al., 2018))
- LSTM layers: ≈0.07M per branch (RT, DT only)
- RD max-pool path: fully connected to 8,960 dims, then pool and reduce to 128-dim
- Final fusion and linear classifier
Hyperparameters align with backbone conventions: channel attention bottleneck , CBAM spatial convolution kernel , He initialization, dropouts, Adam optimizer (, decayed every 30 epochs over 300 epochs), and batch normalization. Pretrained EfficientNet weights are reused where possible. Training uses cross-entropy loss and data-augmented spectrograms with additive Gaussian noise tailored to regional power.
5. Experimental Validation
Tests were conducted on the University of Glasgow FMCW radar dataset (5.8 GHz, 400 MHz, 6 classes). PECL yields 96.16% overall accuracy, outperforming competing baselines (classical CNNs, transformer-based models) by ≥4.78% and using orders-of-magnitude fewer FLOPs than transformer alternatives. Performance is class-uniform: walking, sitting down, standing up, and fall exceed 98.6% accuracy. Notably, PECL achieves high discrimination in the most confused pairs—‘Pick Up’ vs. ‘Drink’—with ~10–15% reduction in mutual misclassification.
Ablation analysis reveals:
- Parallel multi-domain fusion is superior to single-branch approaches (max gain +17.54% over weakest mono-domain).
- CBAM modules consistently outperform SE attention (+1% per branch), and sequential channel→spatial attention outperforms other arrangements (Woo et al., 2018).
- LSTM increases temporal discrimination in RT and DT; its inclusion in RD is detrimental.
- Max-pooling in the RD branch’s projection pathway is essential (avg-pool degrades accuracy to 78.98%).
6. Impact and Methodological Distinctions
PECL establishes that spatial-temporal attention and heterogeneous, domain-specific parallelism are critical for robust radar-based activity recognition in privacy-sensitive settings. Its architecture accepts simultaneous complementary views of spatiotemporal radar signatures, applies lightweight CBAMs for both channel (“what”) and spatial (“where”) valorization, and maximally exploits temporal dynamics where available.
Key features distinguishing PECL include:
- Joint channel-spatial CBAM attention per branch, consistently surpassing SE-only or no-attention designs (Woo et al., 2018)
- Explicit temporal modeling using LSTM units in streams with time ordering, and explicit avoidance of LSTM in spatial-only streams, matching the ablation findings
- Efficient resource usage: moderate parameter count and FLOPs
- Universality: the design can generalize to other multi-modal, multi-domain sensor fusion problems where modality-complementary temporal dependencies must be robustly and efficiently encoded
A plausible implication is that extending such heterogeneous, attention-enhanced, parallel architectures may benefit a wider set of sequence-based sensor recognition tasks, provided that fusion, attention, and temporal modeling strategies are adapted to each domain's structural idiosyncrasies.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free