Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeltaGRU: Energy-Optimized Digital Predistortion

Updated 24 May 2026
  • DeltaGRU is an energy-optimized recurrent neural network architecture that exploits temporal sparsity to reduce redundant computations in RF power amplifier digital predistortion.
  • It integrates a temporal convolutional residual path that ensures accurate linearization while maintaining low computational and energy requirements in GHz-rate, memory-constrained systems.
  • Its design features, including two-threshold gating and quantization-aware training, achieve competitive ACPR and EVM performance with significantly fewer active parameters.

DeltaGRU is an energy-optimized recurrent neural network architecture introduced in the context of neural network digital predistortion (NN-DPD) for wideband radio-frequency (RF) power amplifiers, with key innovations centered on temporal sparsity exploitation and hardware-efficient computation. In its primary application within the TRes-DeltaGRU digital predistortion (DPD) algorithm, it achieves competitive linearization performance (up to –59.4 dBc ACPR, –42.1 dB EVM) with substantial reductions in inference energy and parameter count, making it suitable for GHz-rate, memory-limited embedded systems (Wu et al., 9 Jul 2025).

1. Motivation and Foundations

Digital predistortion for RF power amplifiers must address nonlinearities and memory effects at MHz-to-GHz bandwidths. Conventional NN-based DPDs, often using large GRUs or LSTMs, improve signal fidelity but suffer from high MAC and memory demands, incurring significant energy consumption in high-throughput digital back-ends.

DeltaGRU exploits observed temporal stability in both input features and hidden states—the fact that, sample-to-sample, only a fraction of signals and internal activations change significantly. By updating only the “significant” deltas (determined by per-element thresholds), many redundant computations can be skipped. This mechanism is augmented by a lightweight temporal convolutional (TCN) “residual” path (TRes) to maintain linearization accuracy even as recurrent-path computations are pruned.

The resulting architecture—TRes-DeltaGRU—combines compressed parameterization (≈1k parameters), dynamic temporal sparsity (50–80%), and quantization amenability for low-power fixed-point inference (Wu et al., 9 Jul 2025).

2. Network Architecture

The TRes-DeltaGRU block has four integral components:

  • Input Feature Embedding: Each baseband sample index tt yields the vector

ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^6

comprising present/next-sample I/Q and amplitude terms for implicit memory.

  • DeltaGRU Recurrent Core: Hidden size HH (typically 15). GRU state updates are based exclusively on componentwise input and hidden deltas exceeding thresholds (Θϕ,Θh\Theta_\phi, \Theta_h). Dense matrix-vector products become dense-matrix × sparse-vector, dramatically reducing computation according to per-step activity.
  • Temporal Convolutional Residual (TRes) Path: A two-layer dilated TCN, with kernel sizes K=3K=3 and K=1K=1, dilation factors d=16d=16 and d=0d=0, Hardswish activations, and non-causal padding. This module learns short-term dependencies directly from the input sequence for output correction.
  • Output Projection: The predistorted output sequence is computed as

u^t=Wy^ht+by^,ut=u^t+TCN(X)t\hat{\mathbf{u}}_t = W_{\hat{y}} h_t + b_{\hat{y}}, \qquad \mathbf{u}_t = \hat{\mathbf{u}}_t + \mathrm{TCN}(\mathbf{X})_t

allowing the TCN residual to decouple output fidelity from recurrent sparsity.

3. DeltaGRU Mechanism and Mathematical Formulation

DeltaGRU incorporates a two-threshold gating mechanism for both input and hidden deltas:

  • Delta Tracking: For each signal kk at step ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^60,

ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^61

and ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^62 is similarly updated only when the threshold is exceeded. The same applies to the hidden state deltas (ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^63).

  • Accumulation Registers: Rather than recomputing full GRU gates, pre-activation accumulators ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^64 are incrementally updated with only the sparse deltas:

ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^65

and analogous updates for ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^66, ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^67, ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^68, with initial bias terms.

  • Gate Activations and State Update:

ϕt=[Ixt,Qxt,Ixt+1,Qxt+1,xt,xt3]R6\boldsymbol{\phi}_t = [I_{x_t}, Q_{x_t}, I_{x_{t+1}}, Q_{x_{t+1}}, |x_t|, |x_t|^3]^\top \in \mathbb{R}^69

Only the subset of MACs (“active params”) corresponding to nonzero deltas are computed per step.

This approach enables dynamic adaptation to changing signal/activity patterns, yielding observed sparsity (HH0) of 50%–80%.

4. Training, Quantization, and Temporal Sparsity

Training Protocol

  • Data: APA_200MHz TM3.1a 5×40 MHz 256-QAM OFDM, 98,304 samples (60% train / 20% val / 20% test).
  • Model Cascade: Behavioral PA model HH1 (GRU) is trained first, then the DPD HH2 (TRes-DeltaGRU) is trained in cascade to minimize MSE to a linear amplified target HH3.
  • Optimizer/Loss: AdamW, initial HH4 with ReduceOnPlateau, MSE loss, batch size 64, 240 epochs, no explicit regularization beyond weight decay.

Quantization

  • Quantization-Aware Training (QAT): Forward-pass inference in low-precision (e.g., W16A16, W12A12), backward-pass maintains full-precision copies (STE on rounding).
  • Quantization formula:

HH5

with per-layer learned HH6 (power-of-two), range HH7 to HH8.

  • Energy scaling: Up to HH9 arithmetic energy reduction versus FP32, depending on bit-width.

Temporal Sparsity

  • Thresholds: Adjustable independently for input (Θϕ,Θh\Theta_\phi, \Theta_h0) and hidden (Θϕ,Θh\Theta_\phi, \Theta_h1). Scanning over Θϕ,Θh\Theta_\phi, \Theta_h2 (input) and Θϕ,Θh\Theta_\phi, \Theta_h3 (hidden) yields sparsity Θϕ,Θh\Theta_\phi, \Theta_h4 up to 80%. Typical trade-offs include:
    • Θϕ,Θh\Theta_\phi, \Theta_h5: 996 active params (full dense)
    • Θϕ,Θh\Theta_\phi, \Theta_h6: ≈450 active params
    • Θϕ,Θh\Theta_\phi, \Theta_h7: ≈288 active params
  • Computation Reduction: Only the columns corresponding to deltas above threshold are included in MACs, reducing per-step workload proportionally to Θϕ,Θh\Theta_\phi, \Theta_h8.

5. Computational and Energy Efficiency

TRes-DeltaGRU’s computational model partitions inference energy as:

Θϕ,Θh\Theta_\phi, \Theta_h9

With quantized integer operations and temporal sparsity K=3K=30:

  • K=3K=31
  • K=3K=32
  • K=3K=33 where K=3K=34 = bit-width / 32.

Gem5 ARMv7-A simulation yields:

  • FP32: add 0.38 pJ, mul 1.31 pJ
  • INT16: add 0.015 pJ, mul 0.37 pJ
  • INT12: add 0.011 pJ, mul 0.21 pJ
  • L1 D-cache 7.5 pJ, DDR4 1.3 nJ

Observed savings:

  • 2.8× energy reduction (INT12, 0% sparsity; ACPR –54.5 dBc)
  • 5.2× energy reduction (INT12, 72.5% sparsity; >–45 dBc ACPR)
  • 4.5× energy reduction (W16A16, 56% sparsity; –50.3 dBc ACPR, –35.2 dB EVM)

6. Linearization Performance and Trade-offs

DPD Model Sparsity #Active Params Precision ACPR (dBc) EVM (dB)
TRes-ΔGRU (dense) 0% 996 FP32 –59.4 –42.1
TRes-ΔGRU (dense) 0% 996 W16A16 –58.8 –41.2
TRes-ΔGRU (dense) 0% 996 W12A12 –54.5 –37.3
TRes-ΔGRU (sparse) 56% 450 FP32 –52.9 –35.7
TRes-ΔGRU (sparse) 56% 450 W16A16 –53.2 –39.3
TRes-ΔGRU (sparse) 56% 450 W12A12 –50.3 –35.2
TRes-ΔGRU (sparse) 72.5% 288 FP32 –52.0 –37.0
TRes-ΔGRU (sparse) 72.5% 288 W16A16 –48.2 –34.2
TRes-ΔGRU (sparse) 72.5% 288 W12A12 –46.9 –31.0

Dense TRes-DeltaGRU-996 achieves best-reported ACPR and EVM with only ≈1000 parameters. Notably, with 56% sparsity and INT12, it retains –50.3 dBc ACPR and –35.2 dB EVM, outperforming the 3GPP ACPR mask of –45 dBc and EVM mask of –30 dB.

7. Implementation Considerations and Extensions

OpenDPDv2 provides a PyTorch-based end-to-end implementation, with export options to C/C++ for embedded deployment. Gem5-based, cycle-accurate ARM simulations yield realistic workload and memory assessments; custom ASICs could further minimize control overhead by exploiting delta accumulators and sparse MAC arrays. At high sparsity, CPU platform benefits are currently limited by instruction-cache energy rather than arithmetic energy.

Prospective extensions include:

  • Mixed-precision, asynchronous adaptation for online learning;
  • Deeper or alternate TCN/attention residuals to further offset recurrent sparsity;
  • Application to multi-antenna MIMO DPD.

TRes-DeltaGRU exemplifies a unified approach combining (i) temporal-delta gating, (ii) residual TCN correction, and (iii) quantization-aware training, enabling high-performance, low-power digital predistortion suitable for modern embedded RF systems (Wu et al., 9 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeltaGRU Algorithm.