Learnable Activation Functions

Updated 30 May 2026

Learnable activation functions are trainable nonlinearities whose shapes are learned alongside network weights, enabling adaptive, data-driven optimization.
They employ basis expansions such as polynomials, splines, Fourier, and RBFs to finely control properties like smoothness, locality, and spectral bias.
Integrating these activations in architectures like MLPs, CNNs, and transformers enhances model expressivity, convergence speed, and robustness.

A learnable activation function is a nonlinearity $f:\mathbb{R}\to\mathbb{R}$ whose shape is determined by trainable parameters and is learned jointly with the other weights of a neural network. Instead of statically fixing the activation at design time (e.g., ReLU, tanh, GELU), the network allocates additional parameter degrees of freedom to express a broad class of nonlinear functions, enabling adaptive, data-driven optimization of the activation profile. This paradigm is now found across multilayer perceptrons, convolutional architectures, random feature models, kernel machines, Kolmogorov–Arnold networks, and domain-specific systems such as physics-informed neural networks (PINNs). The section below synthesizes the methodologies and analysis from contemporary research.

1. Mathematical Formulation and Basis Families

A learnable activation $f(x;\theta)$ is generally written as an expansion over fixed basis functions: $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ where $g_i$ are fixed or parameterized basis functions (e.g., monomials, polynomials, radial basis functions (RBFs), trigonometric functions, splines, or piecewise-linear elements), $\alpha_i$ are trainable linear weights, and $\phi_i$ are further internal parameters if the basis is itself parameterized (e.g., RBF centers/widths, spline knots). In the case of piecewise polynomials,

$f(x;\theta) = \sum_{k=0}^{K-1} c_k\,B_k(x)$

with $B_k$ as B-spline basis functions and $c_k$ learnable coefficients (Farea et al., 2024).

The following basis families are central:

Family	Example Formula	Trainable Parameters
Gaussian RBF	$\sum_{i=1}^k\alpha_i \exp(-\\|x-c_i\\|^2/\sigma^2)$	$f(x;\theta)$ 0
Fourier	$f(x;\theta)$ 1	$f(x;\theta)$ 2
Chebyshev	$f(x;\theta)$ 3	$f(x;\theta)$ 4
Jacobi	$f(x;\theta)$ 5	$f(x;\theta)$ 6
B-spline	$f(x;\theta)$ 7	$f(x;\theta)$ 8

These families allow precise tailoring of the activation's locality or globality, smoothness, and spectral properties (Farea et al., 2024, Khalfaoui-Hassani et al., 3 Feb 2025). The dual-parameter forms (e.g., Tangma, Dual Parametric ReLU) introduce explicit learned shifts and slopes (Golwala, 2 Jul 2025, Balaji et al., 2019).

2. Training Protocols and Architectural Integration

Learnable activations are inserted wherever static nonlinearities would occur:

MLPs and CNNs: Each layer can be assigned its own activation $f(x;\theta)$ 9. The full forward pass augments the network parameter set $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 0 to include both weight/bias matrices and the activation parameters. Standard optimization steps (SGD/Adam) propagate through both sets (Farea et al., 2024, Apicella et al., 2019, Minhas et al., 2019).
Piecewise Linear / Spline Approaches: In APL, SPLASH, or TV-regularized splines, per-neuron or per-layer activation parameters are updated via gradient descent, sometimes with additional sparsity or total-variation regularizers to control complexity (Agostinelli et al., 2014, Tavakoli et al., 2020, Ducotterd et al., 2022).
Basis Expansion Methods: Polynomial, Fourier, or RBF coefficients are parameterized at a chosen degree/order. Proper variance-preserving initialization is critical to stable learning (Khalfaoui-Hassani et al., 3 Feb 2025, Farea et al., 2024).
Random Feature Models: Learnable activation functions enter as basis expansions on random projections, with weights adapted by ridge regression or similar schemes (Ma et al., 17 Oct 2025, Ma et al., 2024).
Transformers: Substituting fixed GELU with rational function activations (RAF) per feed-forward block, adding only a small parametric overhead and yielding marked empirical gains (Fang et al., 2022).

Kolmogorov–Arnold Networks (KANs) replace fixed post-summation nonlinearities with learnable edge (input-variable) univariate nonlinearities, allowing universal function approximation with much smaller models (Farea et al., 2024, Reinhardt et al., 2024).

3. Expressivity, Training Dynamics, and Trade-offs

Learnable activation functions offer quantifiable increases in expressivity, often characterized by comparisons of Neural Tangent Kernel (NTK) spectra, Hessian eigenvalues, and approximation-theoretic coverage:

Spectral Bias: Studies in PINNs reveal that families such as B-splines and RBFs produce flatter NTK eigenvalue spectra (reduced spectral bias), enabling rapid acquisition of high-frequency target components (Farea et al., 2024). However, excessively flat spectra can cause convergence instability due to large Hessian curvature (large $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 1), while steeply biased bases delay the acquisition of high-frequency information but favor stable, monotonic learning.
Approximation Power: Hermite, Fourier, and tropical polynomial activations can be initialized and tuned to approximate classical activations (ReLU, GELU) to arbitrary accuracy, enabling transfer or fine-tuning of pre-trained models (Khalfaoui-Hassani et al., 3 Feb 2025, Farea et al., 2024).
Model Capacity: Layer- or channel-wise degree allocation in polynomial or spline-based activations increases the total degree or piecewise granularity of the network, enhancing the effective representational power relative to fixed activations (Goyal et al., 2019).
Stability and Scalability: Local-support bases (e.g., B-splines, RBFs) are preferred in regimes with sharp gradients or non-periodic boundary conditions (e.g., stiff classic PDEs), mitigating the Gibbs phenomenon and promoting stable curvature (Farea et al., 2024).

The figure below summarizes NTK and Hessian trade-offs for representative bases in neural PDE solvers (Farea et al., 2024):

Activation Base	NTK Spectral Bias	Hessian $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 2	Generalization
B-spline, RBF	Low	Moderate/Low	Stable, high
Fourier, Chebyshev	Lowest	Highest	Unstable, problem-dependent
Tanh (fixed)	Moderate	Moderate	Most robust on smooth PDEs

4. Specialized Constructions and Algorithmic Innovations

Several strategies have emerged for constructing learnable activation functions:

Parametric Combinations: Linear and affine convex combinations of a small basis set (e.g., ReLU, tanh, id) with simplex or affine constraints on coefficients, yielding monotonicity (convex-hull) or non-monotonic richness (affine-hull) (Manessi et al., 2018).
Discrete Mixture Selection: Gumbel-Softmax and straight-through estimator for stochastic selection among a finite dictionary of candidate functions during training (FlexAct) (Kumar et al., 10 Jan 2026). Regularization prevents scale bias toward unbounded activations.
Series Expansions: Learnable Series Linear Units (LSLU) employ several shifted base activations summed with learnable amplitude and residual linear terms to increase per-layer nonlinearity and improve generalization in shallower networks (Feng et al., 2024).
Rational and Polynomial Parameterization: Rational function activations (as ratio of learnable polynomials) can be plugged directly into transformers, while learnable polynomial activations are now subject to global polynomial optimization (e.g., via Moment-SOS hierarchy) for certified solution recovery, especially in low-dimension/dataset settings (Zhang et al., 4 Oct 2025, Fang et al., 2022).
Functional/Optimization Theoretic: For Lipschitz-constrained nets, the theoretically optimal 1-Lipschitz activation under second-order total variation regularization is a linear spline, and optimal parameters can be recovered with a representer theorem (Ducotterd et al., 2022).

5. Applications, Empirical Results, and Task-Specific Guidelines

Empirical evaluations consistently report that well-constructed learnable activation functions improve test accuracy, convergence, and sometimes adversarial robustness across domains:

Physics-Informed Neural Networks (PINNs): Proper basis selection is task-dependent. For globally oscillatory PDEs, tanh or low-degree Chebyshev suffice; for multi-scale or sharp-boundary PDEs, local-support bases (B-splines, RBFs) are critical. Overly global bases risk instability unless stringent boundary handling is implemented (Farea et al., 2024).
Transformer LLMs: RAFs yield lower perplexity and higher accuracy (up to +5.7 points on GLUE in low-data regimes) than GELU, with each pre-trained layer specializing its solution to the target task (Fang et al., 2022).
Vision Networks: Empirical gains of 0.5–3 percentage points accuracy are typical for series-based (LSLU), spline-based, or basis-combination activations, with typically small per-layer parameter and runtime overhead (Feng et al., 2024, Tavakoli et al., 2020, Agostinelli et al., 2014).
Random Feature Models: Integrating learnable activation functions with effective weighted-sampling (e.g., via leverage scores) achieves low excess risk with an order of magnitude fewer features than plain random sampling; see RFLAF and its empirical validation (Ma et al., 17 Oct 2025, Ma et al., 2024).

Guidance:

Select basis family and support according to the anticipated task structure (e.g., support sharp gradients or global oscillations).
Monitor NTK spectrum and Hessian eigenvalues to ensure neither excessive bias nor unstable curvature (e.g., $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 3 at learning rate $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 4 for Adam) (Farea et al., 2024).
Use shallow models with moderate parameterized activations as a starting point, increasing model complexity only if performance and convergence remain stable.
For activation selection problems, discrete approaches (e.g., FlexAct) offer modularity and interpretability (Kumar et al., 10 Jan 2026).

6. Limitations, Open Problems, and Theoretical Foundations

While learnable activation functions have demonstrated broad utility, certain constraints and limitations are observed:

Scalability: For very high-dimensional or deep models, local minima, optimization sensitivity, and parameter blow-up can become problematic, particularly with global supports or high-degree polynomials (Farea et al., 2024, Khalfaoui-Hassani et al., 3 Feb 2025).
Initialization: Proper coefficient normalization and variance-preserving schemes are pivotal, especially for polynomial, Fourier, or spline-based parameterizations (Khalfaoui-Hassani et al., 3 Feb 2025).
Generalization and Overfitting: Overly rich activations without regularization can induce overfitting. Penalties such as $f(x;\theta) = \sum_{i=1}^m \alpha_i\,g_i(x;\phi_i)$ 5 on expansion coefficients or second-order total variation for splines are recommended (Ducotterd et al., 2022).
Implementation Overhead: Parameter and computational costs, as well as the need for careful scheduling (e.g., activation hyperparameters, temperature annealing), can increase relative to static activations (Feng et al., 2024, Tavakoli et al., 2020).
Nonconvexity: For polynomial activations, global optimization (Moment-SOS) is possible for small models, but the approach is computationally prohibitive for large-scale networks (Zhang et al., 4 Oct 2025).

Theoretical results guarantee universal approximation power, explicit polynomial representability, and (in regularized spline cases) existence and finite support of optimal activation forms under broad functional-analytic constraints (Goyal et al., 2019, Ducotterd et al., 2022).

7. Representative Implementations and Practical Recipes

Several open-source implementations and recipes are available:

torchortho provides Hermite, Fourier, and tropical polynomial activations with variance-preserving initialization for PyTorch (Khalfaoui-Hassani et al., 3 Feb 2025).
Spline-based activations and projection layers for 1-Lipschitz functions are available for plug-and-play with standard autodiff frameworks (Ducotterd et al., 2022).
Random Feature Models with Learnable Activations are implementable in PyTorch by broadcasting activation coefficient vectors over random projections (Ma et al., 2024).
Transformers: RAFs, replacing static GELU in standard transformer architectures, increase parameter count negligibly but enable on-the-fly adaptation (Fang et al., 2022).
SineKAN: Learnable sinusoidal edge activations in Kolmogorov–Arnold Networks, initialized with phase grids and amplitude/frequency parameters, implemented with standard stochastic optimization (Reinhardt et al., 2024).

In conclusion, learnable activation functions offer a flexible and theoretically grounded means of enhancing neural model expressivity, are accompanied by mature methodologies for basis construction and optimization, and are proven to confer measurable gains across a variety of settings, provided their parameterization and regularization are adapted to the structure of the specific learning problem.