Widely Linear Kernel Activation Functions
- Widely linear kernel activation functions extend standard KAFs by incorporating both the input and its complex conjugate, enabling richer nonlinear transformations.
- They combine kernel and pseudo-kernel responses with trainable complex coefficients to capture dependencies between real and imaginary parts efficiently.
- Empirical evaluations show WL-KAFs improve convergence speed and accuracy on complex pattern recognition tasks compared to conventional approaches.
Widely linear kernel activation functions (WL-KAFs) are a family of flexible, data-driven activation functions designed for complex-valued neural networks (CVNNs). These functions extend the kernel activation function (KAF) paradigm to the complex domain by using widely linear kernels, permitting richer nonlinear transformations that exploit both the input and its complex conjugate. WL-KAFs enable enhanced expressivity in CVNNs with minimal computational and parameter overhead, and have demonstrated improved performance on complex-valued pattern recognition tasks (Scardapane et al., 2019).
1. Background: Complex-Valued Neural Networks and KAFs
CVNNs generalize real-valued feedforward networks by allowing all weights, biases, and activations to be complex-valued. For an -layer CVNN, the transformation is
where each layer computes a complex affine transformation followed by an activation: , with complex, and elementwise. Training minimizes a loss , typically a sum of a data-dependent loss (e.g., squared error, complex cross-entropy) and a regularization term.
The standard KAF framework sidesteps the need to select a fixed analytic activation by learning each neuron's activation function via a one-dimensional kernel expansion. Fixing a dictionary , the activation is
where are trainable and the kernel is usually the complex Gaussian, .
While KAFs allow neuronwise activations tuned to data, standard complex KAFs are limited; they cannot model arbitrary dependencies between the real and imaginary parts of , as they entail intrinsic constraints among subblocks of their reproducing kernel Hilbert space (RKHS) representations.
2. Widely Linear Kernel Activation Function Formulation
Widely linear kernels extend KAFs by incorporating a dependence on both and its conjugate , thus lifting the constraint that the expansion models only analytic functions. A widely linear kernel is defined as
where is the original kernel and is the so-called pseudo-kernel, typically involving the conjugate.
The widely linear KAF (WL-KAF) for a single neuron is then
where both . In practice, is often set to , giving the compact trainable form . The number of trainable parameters per neuron remains complex coefficients, as in the standard KAF.
Different WL-KAF flavors are defined via choices of and :
- Case 1 (Independent real/imag parts):
with bandwidths .
- Case 2 (Mixed-effects separable kernels):
where each is real-valued (typically Gaussian) and .
These forms allow independent or coupled modeling of real and imaginary parts, and can recover the standard KAF as a special case.
3. Training, Architecture, and Implementation
WL-KAFs are integrated into CVNN layers as drop-in replacements for analytic nonlinearities. The forward pass for a neuron computes the kernel and pseudo-kernel response vectors, uses the learned mixing coefficients, and outputs the linear combination:
- Linear preactivation: .
- For neuron , compute and .
- Output: .
Gradients are propagated using
Training employs standard optimizers (e.g., Adagrad, Adam). Hyperparameters include the dictionary size (e.g., $16$ or $64$), dictionary grid (uniform over ), kernel bandwidths (), and regularization constant . Dictionary elements and kernel bandwidths are typically initialized using heuristics, then fine-tuned by gradient descent. Early stopping and weight decay on the kernel coefficients are recommended.
In terms of complexity:
- Each neuron retains complex coefficients (as in standard KAF).
- Forward/backward pass per neuron has cost for both kernel and pseudo-kernel, effectively doubling kernel computations relative to standard KAF, though this is a minor constant factor.
4. Empirical Evaluation and Results
Performance was investigated on image-classification benchmarks transformed to the complex domain using 2D FFT, with the top 100 coefficients per image selected as vectors. Datasets included MNIST, Fashion-MNIST, EMNIST Digits, and Latin OCR.
Each model used three hidden layers of 100 complex neurons with KAF or WL-KAF activations; the output layer performed a softmax on . Optimization used Adagrad, batch size 40, with grid search to tune regularization.
Test accuracy (mean std over five runs):
| Model | MNIST (%) | F-MNIST (%) | EMNIST-D (%) | Latin OCR (%) |
|---|---|---|---|---|
| Real-valued NN | 92.39 0.10 | 71.08 0.45 | 92.78 1.25 | 39.01 3.42 |
| Complex KAF | 97.18 0.27 | 81.94 0.91 | 98.11 2.04 | 71.79 2.40 |
| WL-KAF (Case 1) | 97.50 0.41 | 77.29 2.43 | 98.46 0.12 | 74.57 0.80 |
| WL-KAF (Case 2) | 96.22 0.74 | 82.89 1.09 | 99.03 1.01 | 72.53 0.36 |
WL-KAFs achieved performance improvements over standard KAFs that were statistically significant at using paired -tests. Convergence speed with WL-KAFs was typically faster (plateau at 4,000 iterations) compared to standard KAFs (6,000 iterations) (Scardapane et al., 2019).
5. Practical Considerations and Recommendations
- Expressiveness vs. cost: WL-KAFs provide a substantial gain in nonlinear modeling power with negligible increase in parameter count or computational footprint. Their use is preferred over standard KAFs except in highly constrained deployment scenarios.
- Case selection: Case 1 is suitable when the real and imaginary parts of the nonlinearity are approximately independent, minimizing hyperparameter requirements. Case 2 is indicated where modeling cross-correlation is necessary, e.g., in signal processing.
- Hyperparameter tuning:
- Dictionary: choose elements covering typical activation range (e.g., uniform in ).
- Bandwidth: initialize by median heuristic or rules from real KAF literature, allow further tuning.
- Dictionary size should remain moderate (16–64) to balance capacity and overfitting risk.
- Regularization and optimization:
- Apply weight decay on coefficients.
- Employ early stopping based on validation loss.
- Adaptive optimizers such as Adagrad or Adam handle the disparity in gradient scales.
- Monitor gradient norms for real and imaginary components independently to maintain training stability.
For practical deployment, implementing WL-KAFs involves fixing the complex dictionary, coding the forward and backward routines for both kernel and pseudo-kernel, integrating with CVNN libraries (e.g., TensorFlow, PyTorch), and tuning the principal hyperparameters (, , , optionally and ). These steps suffice to equip CVNNs with neuron-specific, highly expressive nonlinearities suitable for a broad class of complex-valued learning problems (Scardapane et al., 2019).
6. Context and Significance
Widely linear kernel activation functions address a foundational limitation of standard (analytic) KAFs by enabling the modeling of arbitrary dependencies between real and imaginary parts in complex-valued transformations. This is accomplished without increasing the number of trainable parameters per neuron. The observed empirical gains—in accuracy and convergence rate—across standard complex pattern recognition benchmarks underscore their utility in both scientific and engineering contexts. Their introduction represents a principled extension of KAF theory, contributing to the expressiveness and practicality of modern CVNNs (Scardapane et al., 2019).