DeltaGRU: Energy-Optimized Digital Predistortion
- DeltaGRU is an energy-optimized recurrent neural network architecture that exploits temporal sparsity to reduce redundant computations in RF power amplifier digital predistortion.
- It integrates a temporal convolutional residual path that ensures accurate linearization while maintaining low computational and energy requirements in GHz-rate, memory-constrained systems.
- Its design features, including two-threshold gating and quantization-aware training, achieve competitive ACPR and EVM performance with significantly fewer active parameters.
DeltaGRU is an energy-optimized recurrent neural network architecture introduced in the context of neural network digital predistortion (NN-DPD) for wideband radio-frequency (RF) power amplifiers, with key innovations centered on temporal sparsity exploitation and hardware-efficient computation. In its primary application within the TRes-DeltaGRU digital predistortion (DPD) algorithm, it achieves competitive linearization performance (up to –59.4 dBc ACPR, –42.1 dB EVM) with substantial reductions in inference energy and parameter count, making it suitable for GHz-rate, memory-limited embedded systems (Wu et al., 9 Jul 2025).
1. Motivation and Foundations
Digital predistortion for RF power amplifiers must address nonlinearities and memory effects at MHz-to-GHz bandwidths. Conventional NN-based DPDs, often using large GRUs or LSTMs, improve signal fidelity but suffer from high MAC and memory demands, incurring significant energy consumption in high-throughput digital back-ends.
DeltaGRU exploits observed temporal stability in both input features and hidden states—the fact that, sample-to-sample, only a fraction of signals and internal activations change significantly. By updating only the “significant” deltas (determined by per-element thresholds), many redundant computations can be skipped. This mechanism is augmented by a lightweight temporal convolutional (TCN) “residual” path (TRes) to maintain linearization accuracy even as recurrent-path computations are pruned.
The resulting architecture—TRes-DeltaGRU—combines compressed parameterization (≈1k parameters), dynamic temporal sparsity (50–80%), and quantization amenability for low-power fixed-point inference (Wu et al., 9 Jul 2025).
2. Network Architecture
The TRes-DeltaGRU block has four integral components:
- Input Feature Embedding: Each baseband sample index yields the vector
comprising present/next-sample I/Q and amplitude terms for implicit memory.
- DeltaGRU Recurrent Core: Hidden size (typically 15). GRU state updates are based exclusively on componentwise input and hidden deltas exceeding thresholds (). Dense matrix-vector products become dense-matrix × sparse-vector, dramatically reducing computation according to per-step activity.
- Temporal Convolutional Residual (TRes) Path: A two-layer dilated TCN, with kernel sizes and , dilation factors and , Hardswish activations, and non-causal padding. This module learns short-term dependencies directly from the input sequence for output correction.
- Output Projection: The predistorted output sequence is computed as
allowing the TCN residual to decouple output fidelity from recurrent sparsity.
3. DeltaGRU Mechanism and Mathematical Formulation
DeltaGRU incorporates a two-threshold gating mechanism for both input and hidden deltas:
- Delta Tracking: For each signal at step 0,
1
and 2 is similarly updated only when the threshold is exceeded. The same applies to the hidden state deltas (3).
- Accumulation Registers: Rather than recomputing full GRU gates, pre-activation accumulators 4 are incrementally updated with only the sparse deltas:
5
and analogous updates for 6, 7, 8, with initial bias terms.
- Gate Activations and State Update:
9
Only the subset of MACs (“active params”) corresponding to nonzero deltas are computed per step.
This approach enables dynamic adaptation to changing signal/activity patterns, yielding observed sparsity (0) of 50%–80%.
4. Training, Quantization, and Temporal Sparsity
Training Protocol
- Data: APA_200MHz TM3.1a 5×40 MHz 256-QAM OFDM, 98,304 samples (60% train / 20% val / 20% test).
- Model Cascade: Behavioral PA model 1 (GRU) is trained first, then the DPD 2 (TRes-DeltaGRU) is trained in cascade to minimize MSE to a linear amplified target 3.
- Optimizer/Loss: AdamW, initial 4 with ReduceOnPlateau, MSE loss, batch size 64, 240 epochs, no explicit regularization beyond weight decay.
Quantization
- Quantization-Aware Training (QAT): Forward-pass inference in low-precision (e.g., W16A16, W12A12), backward-pass maintains full-precision copies (STE on rounding).
- Quantization formula:
5
with per-layer learned 6 (power-of-two), range 7 to 8.
- Energy scaling: Up to 9 arithmetic energy reduction versus FP32, depending on bit-width.
Temporal Sparsity
- Thresholds: Adjustable independently for input (0) and hidden (1). Scanning over 2 (input) and 3 (hidden) yields sparsity 4 up to 80%. Typical trade-offs include:
- 5: 996 active params (full dense)
- 6: ≈450 active params
- 7: ≈288 active params
- Computation Reduction: Only the columns corresponding to deltas above threshold are included in MACs, reducing per-step workload proportionally to 8.
5. Computational and Energy Efficiency
TRes-DeltaGRU’s computational model partitions inference energy as:
9
With quantized integer operations and temporal sparsity 0:
- 1
- 2
- 3 where 4 = bit-width / 32.
Gem5 ARMv7-A simulation yields:
- FP32: add 0.38 pJ, mul 1.31 pJ
- INT16: add 0.015 pJ, mul 0.37 pJ
- INT12: add 0.011 pJ, mul 0.21 pJ
- L1 D-cache 7.5 pJ, DDR4 1.3 nJ
Observed savings:
- 2.8× energy reduction (INT12, 0% sparsity; ACPR –54.5 dBc)
- 5.2× energy reduction (INT12, 72.5% sparsity; >–45 dBc ACPR)
- 4.5× energy reduction (W16A16, 56% sparsity; –50.3 dBc ACPR, –35.2 dB EVM)
6. Linearization Performance and Trade-offs
| DPD Model | Sparsity | #Active Params | Precision | ACPR (dBc) | EVM (dB) |
|---|---|---|---|---|---|
| TRes-ΔGRU (dense) | 0% | 996 | FP32 | –59.4 | –42.1 |
| TRes-ΔGRU (dense) | 0% | 996 | W16A16 | –58.8 | –41.2 |
| TRes-ΔGRU (dense) | 0% | 996 | W12A12 | –54.5 | –37.3 |
| TRes-ΔGRU (sparse) | 56% | 450 | FP32 | –52.9 | –35.7 |
| TRes-ΔGRU (sparse) | 56% | 450 | W16A16 | –53.2 | –39.3 |
| TRes-ΔGRU (sparse) | 56% | 450 | W12A12 | –50.3 | –35.2 |
| TRes-ΔGRU (sparse) | 72.5% | 288 | FP32 | –52.0 | –37.0 |
| TRes-ΔGRU (sparse) | 72.5% | 288 | W16A16 | –48.2 | –34.2 |
| TRes-ΔGRU (sparse) | 72.5% | 288 | W12A12 | –46.9 | –31.0 |
Dense TRes-DeltaGRU-996 achieves best-reported ACPR and EVM with only ≈1000 parameters. Notably, with 56% sparsity and INT12, it retains –50.3 dBc ACPR and –35.2 dB EVM, outperforming the 3GPP ACPR mask of –45 dBc and EVM mask of –30 dB.
7. Implementation Considerations and Extensions
OpenDPDv2 provides a PyTorch-based end-to-end implementation, with export options to C/C++ for embedded deployment. Gem5-based, cycle-accurate ARM simulations yield realistic workload and memory assessments; custom ASICs could further minimize control overhead by exploiting delta accumulators and sparse MAC arrays. At high sparsity, CPU platform benefits are currently limited by instruction-cache energy rather than arithmetic energy.
Prospective extensions include:
- Mixed-precision, asynchronous adaptation for online learning;
- Deeper or alternate TCN/attention residuals to further offset recurrent sparsity;
- Application to multi-antenna MIMO DPD.
TRes-DeltaGRU exemplifies a unified approach combining (i) temporal-delta gating, (ii) residual TCN correction, and (iii) quantization-aware training, enabling high-performance, low-power digital predistortion suitable for modern embedded RF systems (Wu et al., 9 Jul 2025).