Kernel Activation Functions (KAFs)
- KAFs are non-parametric, trainable activations that model neuron nonlinearities via kernel expansions, allowing each neuron to learn a smooth, flexible activation shape.
- They integrate seamlessly into diverse architectures—such as feedforward, convolutional, graph, and recurrent networks—improving convergence and performance.
- Effective use of KAFs requires careful hyperparameter tuning and regularization to balance increased computational cost with enhanced approximation capabilities.
Kernel Activation Functions (KAFs) are a class of non-parametric, trainable activation functions for neural networks, in which each neuron's nonlinearity is parameterized as a kernel expansion over a fixed set of centers, with mixing coefficients learned via backpropagation. KAFs enable each neuron to learn a highly flexible, smooth, and potentially non-convex activation shape specific to the data. These properties, together with their amenability to standard regularization and hardware vectorization, have led to their integration and empirical success in diverse architectures: feedforward and convolutional networks, graph neural networks, recurrent networks, Siamese and few-shot models, and complex-valued networks.
1. Mathematical Formulation and Core Model
The canonical KAF expresses a scalar-to-scalar activation as
where:
- are the learnable mixing coefficients,
- are fixed dictionary centers, chosen by uniform sampling over a bounded interval (e.g., ),
- is a positive-definite kernel, most commonly the Gaussian: , with bandwidth .
This mechanism implements a dictionary-based, neuron-wise kernel smoother in the activation space, which is parameterized linearly in . All coefficients are optimized alongside the usual weight parameters via gradient descent and back-propagation, with derivatives: (Scardapane et al., 2017, Scardapane et al., 2018, Jadon et al., 2019).
KAFs readily generalize to multi-dimensional variants (e.g., 2D-KAFs over with centers), and to polynomial or rational quadratic kernels, supporting both increase in expressivity and compatibility with various architectures.
2. Hyperparameter Selection, Initialization, and Regularization
Hyperparameter choices critical to KAFs include the number of centers , the dictionary interval , and the kernel bandwidth . In practical settings, –$20$ typically suffices for real tasks; centers are distributed uniformly, and the Gaussian bandwidth is set as , being the grid spacing (Scardapane et al., 2018, Jadon et al., 2019).
Initialization of can leverage kernel ridge regression to approximate a desirable starting shape such as ReLU or ELU: with small for stability.
The coefficient vector can be regularized with classic penalties, e.g., ( or $2$), and kernel parameters (notably ) may be further regularized or even learned during training (Scardapane et al., 2018, Scardapane et al., 2017, Jadon et al., 2019).
3. Theoretical Properties and Generalization
KAFs possess universal approximation properties on compact domains: with sufficiently many centers and a Gaussian kernel, any continuous univariate function can be approximated arbitrarily well (Scardapane et al., 2017). The expansion is -smooth whenever the kernel is (e.g., Gaussian), conferring smoother gradients and better optimization properties than piecewise-linear activations.
Generalization theory has established that, provided the kernel bandwidth scales inversely with the square of the maximum layer width (, of hidden width or ), the empirical loss is Lipschitz and smooth, and the SGD algorithm is uniformly stable, yielding classical generalization bounds in the sense of Hardt et al. (2016) (Cirillo et al., 2019).
4. Architectural Integration and Variants
Standard Feedforward/CNNs
KAFs are used as drop-in replacements for pointwise activations in both dense and convolutional layers. In vectorized implementations, per-neuron activations become efficient batch-wise matrix–vector multiplies over the kernel matrix and the vector (Scardapane et al., 2017).
In convolutional nets, KAFs can be specialized (e.g., EvenPowLin activations based on polynomial kernels) to enhance handling of specific symmetries, such as inversion-robustness in vision tasks (Nasiri et al., 2021).
Graph Neural Networks
KAFs have been shown to improve GCNs, yielding notable accuracy gains (e.g., Cora: , Citeseer: on semi-supervised node classification), with reduced convergence epochs and no comparable benefit from simply increasing network width/depth. The integration is achieved by substituting the standard nonlinearity in the (spectral) GCN update with the KAF (Scardapane et al., 2018).
Broader graph-adaptive activation functions also permit kernelized, neighborhood-adaptive nonlinearities, proven to preserve permutation equivariance and Lipschitz stability (Iancu et al., 2020).
Recurrent Neural Networks
Flexible gates in RNNs (e.g., GRUs) can be built by wrapping a KAF with a sigmoid and adding a residual linear path, enabling the gate to learn a wide range of shapes. This leads to improved accuracy on long-range sequence tasks (e.g., pixel-wise MNIST: ), faster convergence, and greater robustness to data permutations (Scardapane et al., 2018).
Siamese and Metric Learning Models
Replacing ReLU with KAF or 2D-KAF in Siamese architectures fosters tighter intra-class clustering and increased inter-class separation in the learned embedding space, consistently improving few-shot classification performance (e.g., Omniglot one-shot accuracy: ReLU KAF2D ), albeit with moderate per-epoch computational cost increase (Jadon et al., 2019).
Complex-Valued Networks
KAFs extend to by defining neuron-wise expansions over fixed complex dictionaries and complex-valued positive-definite kernels, with coefficients learned via Wirtinger calculus.
- Fully complex KAFs and their "widely linear" extensions model the full set of noncircular, vector-valued, and non-holomorphic activation behaviors, leading to substantial gains on tasks such as complex-valued MNIST (accuracy: real-NN , CVNN-KAF , CVNN-WL-KAF up to ) (Scardapane et al., 2018, Scardapane et al., 2019).
Multikernel and Learnable Activation Extensions
Multi-KAFs expand flexibility by learning per-neuron convex combinations of multiple base kernels (e.g., Gaussian, polynomial), enabling each unit to adapt the form and scale of its nonlinearity (Scardapane et al., 2019).
Recent advances in random feature models (RFLAF) integrate KAF-like activations within random feature expansions, achieving universal approximation over with only the parameter count of standard random features, and offering explicit interpretability by directly reconstructing the learned activation $\sigmâ$ from the final weight vector (Ma et al., 2024).
5. Empirical Performance and Practical Impact
KAFs and their variants consistently outperform or match strong baselines (ReLU, ELU, PReLU, maxout, parametric splines, etc.) across a suite of tasks—classification, regression, RL policy learning—frequently with shallower networks, fewer parameters, or fewer training epochs:
- Feedforward MLPs: up to $0.5$–$1.3$pp accuracy gain over ReLU and PReLU (Scardapane et al., 2017, Scardapane et al., 2019).
- CNNs: PowerLinear/EvenPowLin achieves parity with ReLU on standard data but orders-of-magnitude better inversion-robustness (MNIST inversion: EvenPowLin vs. ReLU ) (Nasiri et al., 2021).
- RNNs: Significant improvement in sequential modeling, especially for long dependencies (Scardapane et al., 2018).
- GCNs: Empirically unique class separation and improved learning stability (Scardapane et al., 2018).
- Random feature models: Substantial test loss reductions with explicit activation shape recovery (Ma et al., 2024).
Convergence is typically faster (e.g., 50% fewer epochs in GRU+KAF), and training remains stable when regularization and sensible initialization are used.
6. Limitations, Trade-offs, and Implementation Considerations
The main limitations of KAFs are increased per-neuron parameter count (– in 2D), additional computational overhead ( per neuron, for 2D-KAF), and the need to choose dictionary/grid size and bandwidth (). Initialization and regularization are essential to mitigate gradient noise and overfitting, especially in the early epochs (Scardapane et al., 2017, Jadon et al., 2019).
Multi-KAFs ameliorate hypersensitivity to single kernel choice by learning mixtures, but further increase parameter/compute complexity ( per neuron). Hardware vectorization and careful batching are critical for efficiency, as all kernel computations are parallelizable (Scardapane et al., 2019).
For complex-valued networks, gradient computation, initialization and kernel choice must account for non-holomorphicity and circular/noncircular statistics, typically requiring CR-calculus (Scardapane et al., 2018, Scardapane et al., 2019).
7. Outlook and Extensions
Research directions include automatic kernel-parameter adaptation, hierarchical or data-driven dictionary selection, integration in deeper and convolutional architectures (notably in the complex domain), formal sample-complexity and generalization analyses for complex and graph-adaptive variants, and interpretability of learned nonlinearities at scale (Ma et al., 2024, Scardapane et al., 2019).
KAFs exemplify a conceptually unified design space interpolating between fixed, element-wise nonlinearities and full nonparametric functional learning inside deep networks, providing a flexible toolset for adaptively shaping information flow throughout modern architectures.