DeltaGRU: Energy-Optimized Digital Predistortion

Updated 24 May 2026

DeltaGRU is an energy-optimized recurrent neural network architecture that exploits temporal sparsity to reduce redundant computations in RF power amplifier digital predistortion.
It integrates a temporal convolutional residual path that ensures accurate linearization while maintaining low computational and energy requirements in GHz-rate, memory-constrained systems.
Its design features, including two-threshold gating and quantization-aware training, achieve competitive ACPR and EVM performance with significantly fewer active parameters.

DeltaGRU is an energy-optimized recurrent neural network architecture introduced in the context of neural network digital predistortion (NN-DPD) for wideband radio-frequency (RF) power amplifiers, with key innovations centered on temporal sparsity exploitation and hardware-efficient computation. In its primary application within the TRes-DeltaGRU digital predistortion (DPD) algorithm, it achieves competitive linearization performance (up to –59.4 dBc ACPR, –42.1 dB EVM) with substantial reductions in inference energy and parameter count, making it suitable for GHz-rate, memory-limited embedded systems (Wu et al., 9 Jul 2025).

1. Motivation and Foundations

Digital predistortion for RF power amplifiers must address nonlinearities and memory effects at MHz-to-GHz bandwidths. Conventional NN-based DPDs, often using large GRUs or LSTMs, improve signal fidelity but suffer from high MAC and memory demands, incurring significant energy consumption in high-throughput digital back-ends.

DeltaGRU exploits observed temporal stability in both input features and hidden states—the fact that, sample-to-sample, only a fraction of signals and internal activations change significantly. By updating only the “significant” deltas (determined by per-element thresholds), many redundant computations can be skipped. This mechanism is augmented by a lightweight temporal convolutional (TCN) “residual” path (TRes) to maintain linearization accuracy even as recurrent-path computations are pruned.

The resulting architecture—TRes-DeltaGRU—combines compressed parameterization (≈1k parameters), dynamic temporal sparsity (50–80%), and quantization amenability for low-power fixed-point inference (Wu et al., 9 Jul 2025).

2. Network Architecture

The TRes-DeltaGRU block has four integral components:

Input Feature Embedding: Each baseband sample index $t$ yields the vector

$\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$

comprising present/next-sample I/Q and amplitude terms for implicit memory.

DeltaGRU Recurrent Core: Hidden size $H$ (typically 15). GRU state updates are based exclusively on componentwise input and hidden deltas exceeding thresholds ( $\Theta_\phi, \Theta_h$ ). Dense matrix-vector products become dense-matrix × sparse-vector, dramatically reducing computation according to per-step activity.
Temporal Convolutional Residual (TRes) Path: A two-layer dilated TCN, with kernel sizes $K=3$ and $K=1$ , dilation factors $d=16$ and $d=0$ , Hardswish activations, and non-causal padding. This module learns short-term dependencies directly from the input sequence for output correction.
Output Projection: The predistorted output sequence is computed as

$\hat{\mathbf{u}}_t = W_{\hat{y}} h_t + b_{\hat{y}}, \qquad \mathbf{u}_t = \hat{\mathbf{u}}_t + \mathrm{TCN}(\mathbf{X})_t$

allowing the TCN residual to decouple output fidelity from recurrent sparsity.

3. DeltaGRU Mechanism and Mathematical Formulation

DeltaGRU incorporates a two-threshold gating mechanism for both input and hidden deltas:

Delta Tracking: For each signal $k$ at step $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 0,

$\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 1

and $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 2 is similarly updated only when the threshold is exceeded. The same applies to the hidden state deltas ( $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 3).

Accumulation Registers: Rather than recomputing full GRU gates, pre-activation accumulators $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 4 are incrementally updated with only the sparse deltas:

$\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 5

and analogous updates for $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 6, $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 7, $\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 8, with initial bias terms.

Gate Activations and State Update:

$\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6$ 9

Only the subset of MACs (“active params”) corresponding to nonzero deltas are computed per step.

This approach enables dynamic adaptation to changing signal/activity patterns, yielding observed sparsity ( $H$ 0) of 50%–80%.

4. Training, Quantization, and Temporal Sparsity

Training Protocol

Data: APA_200MHz TM3.1a 5×40 MHz 256-QAM OFDM, 98,304 samples (60% train / 20% val / 20% test).
Model Cascade: Behavioral PA model $H$ 1 (GRU) is trained first, then the DPD $H$ 2 (TRes-DeltaGRU) is trained in cascade to minimize MSE to a linear amplified target $H$ 3.
Optimizer/Loss: AdamW, initial $H$ 4 with ReduceOnPlateau, MSE loss, batch size 64, 240 epochs, no explicit regularization beyond weight decay.

Quantization

Quantization-Aware Training (QAT): Forward-pass inference in low-precision (e.g., W16A16, W12A12), backward-pass maintains full-precision copies (STE on rounding).
Quantization formula:

$H$ 5

with per-layer learned $H$ 6 (power-of-two), range $H$ 7 to $H$ 8.

Energy scaling: Up to $H$ 9 arithmetic energy reduction versus FP32, depending on bit-width.

Temporal Sparsity

Thresholds: Adjustable independently for input ( $\Theta_\phi, \Theta_h$ $Θ_{ϕ}, Θ_{h}$ 0) and hidden ( $\Theta_\phi, \Theta_h$ $Θ_{ϕ}, Θ_{h}$ 1). Scanning over $\Theta_\phi, \Theta_h$ $Θ_{ϕ}, Θ_{h}$ 2 (input) and $\Theta_\phi, \Theta_h$ $Θ_{ϕ}, Θ_{h}$ 3 (hidden) yields sparsity $\Theta_\phi, \Theta_h$ $Θ_{ϕ}, Θ_{h}$ 4 up to 80%. Typical trade-offs include:
- $\Theta_\phi, \Theta_h$ 5: 996 active params (full dense)
- $\Theta_\phi, \Theta_h$ 6: ≈450 active params
- $\Theta_\phi, \Theta_h$ 7: ≈288 active params
Computation Reduction: Only the columns corresponding to deltas above threshold are included in MACs, reducing per-step workload proportionally to $\Theta_\phi, \Theta_h$ 8.

5. Computational and Energy Efficiency

TRes-DeltaGRU’s computational model partitions inference energy as:

$\Theta_\phi, \Theta_h$ 9

With quantized integer operations and temporal sparsity $K=3$ 0:

$K=3$ 1
$K=3$ 2
$K=3$ 3 where $K=3$ 4 = bit-width / 32.

Gem5 ARMv7-A simulation yields:

FP32: add 0.38 pJ, mul 1.31 pJ
INT16: add 0.015 pJ, mul 0.37 pJ
INT12: add 0.011 pJ, mul 0.21 pJ
L1 D-cache 7.5 pJ, DDR4 1.3 nJ

Observed savings:

2.8× energy reduction (INT12, 0% sparsity; ACPR –54.5 dBc)
5.2× energy reduction (INT12, 72.5% sparsity; >–45 dBc ACPR)
4.5× energy reduction (W16A16, 56% sparsity; –50.3 dBc ACPR, –35.2 dB EVM)

6. Linearization Performance and Trade-offs

DPD Model	Sparsity	#Active Params	Precision	ACPR (dBc)	EVM (dB)
TRes-ΔGRU (dense)	0%	996	FP32	–59.4	–42.1
TRes-ΔGRU (dense)	0%	996	W16A16	–58.8	–41.2
TRes-ΔGRU (dense)	0%	996	W12A12	–54.5	–37.3
TRes-ΔGRU (sparse)	56%	450	FP32	–52.9	–35.7
TRes-ΔGRU (sparse)	56%	450	W16A16	–53.2	–39.3
TRes-ΔGRU (sparse)	56%	450	W12A12	–50.3	–35.2
TRes-ΔGRU (sparse)	72.5%	288	FP32	–52.0	–37.0
TRes-ΔGRU (sparse)	72.5%	288	W16A16	–48.2	–34.2
TRes-ΔGRU (sparse)	72.5%	288	W12A12	–46.9	–31.0

Dense TRes-DeltaGRU-996 achieves best-reported ACPR and EVM with only ≈1000 parameters. Notably, with 56% sparsity and INT12, it retains –50.3 dBc ACPR and –35.2 dB EVM, outperforming the 3GPP ACPR mask of –45 dBc and EVM mask of –30 dB.

7. Implementation Considerations and Extensions

OpenDPDv2 provides a PyTorch-based end-to-end implementation, with export options to C/C++ for embedded deployment. Gem5-based, cycle-accurate ARM simulations yield realistic workload and memory assessments; custom ASICs could further minimize control overhead by exploiting delta accumulators and sparse MAC arrays. At high sparsity, CPU platform benefits are currently limited by instruction-cache energy rather than arithmetic energy.

Prospective extensions include:

Mixed-precision, asynchronous adaptation for online learning;
Deeper or alternate TCN/attention residuals to further offset recurrent sparsity;
Application to multi-antenna MIMO DPD.

TRes-DeltaGRU exemplifies a unified approach combining (i) temporal-delta gating, (ii) residual TCN correction, and (iii) quantization-aware training, enabling high-performance, low-power digital predistortion suitable for modern embedded RF systems (Wu et al., 9 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

OpenDPDv2: A Unified Learning and Optimization Framework for Neural Network Digital Predistortion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeltaGRU Algorithm.

DeltaGRU: Energy-Optimized Digital Predistortion

1. Motivation and Foundations

2. Network Architecture

3. DeltaGRU Mechanism and Mathematical Formulation

4. Training, Quantization, and Temporal Sparsity

Training Protocol

Quantization

Temporal Sparsity

5. Computational and Energy Efficiency

6. Linearization Performance and Trade-offs

7. Implementation Considerations and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeltaGRU: Energy-Optimized Digital Predistortion

1. Motivation and Foundations

2. Network Architecture

3. DeltaGRU Mechanism and Mathematical Formulation

4. Training, Quantization, and Temporal Sparsity

Training Protocol

Quantization

Temporal Sparsity

5. Computational and Energy Efficiency

6. Linearization Performance and Trade-offs

7. Implementation Considerations and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research