Edge-wise Learnable Nonlinearities
- The paper introduces edge-wise learnable nonlinearities, showing that parameterizing each activation via adaptive splines or exponent functions improves network expressiveness.
- Edge-wise learnable univariate nonlinearities are defined as individual, trainable activation functions that enforce constraints like monotonicity and Lipschitz continuity through TV² regularization.
- Empirical results reveal that these adaptive methods yield notable improvements in tasks such as inverse imaging and time-series classification, with enhanced PSNR and accuracy over fixed activations.
Edge-wise learnable univariate nonlinearities refer to the paradigm in neural networks and related layered computational architectures where the nonlinear activation function on each edge (or neuron) is parameterized separately and can be learned from data, possibly subject to functional or structural constraints. In contrast to conventional architectures with a fixed, shared nonlinearity such as ReLU or sigmoid per layer or network, this approach endows each edge with its own adaptive univariate function, enabling highly expressive, data-driven modeling of complex transformations, increased flexibility, and—when properly regularized—strong theoretical guarantees regarding behavior such as monotonicity, Lipschitz continuity, and stability under composition.
1. Mathematical Formulation and Optimization
Edge-wise learnable univariate nonlinearities are incorporated by associating with each edge of the network its own trainable function . The general variational problem consists of jointly optimizing the collection of all linear parameters and activations against a task loss , standard linear regularizer , and a differentiable structure-inducing regularizer on the activations. A principled formulation is given in (Unser et al., 23 Aug 2024): where denotes the space of functions of bounded second variation (the minimal space where the second-order total variation is defined), and the slope-box constraint enforces properties such as monotonicity, 1-Lipschitzness, or firm non-expansiveness on each . The regularization term is the Radon norm of the (distributional) second derivative,
which encourages the learned activations to be piecewise-affine.
For exponent-parameterized edge nonlinearities, as in the learnable power-nonlinearity approach of (Chadha et al., 2019), each edge is endowed with a learned exponent , parameterizing operations such as , with suitable constraints on to maintain numerical stability.
2. Parameterization and Spline Representer Results
In the TV-regularized setting (Unser et al., 23 Aug 2024), a central result is that the optimizer over each is an adaptive linear spline with at most knots, given data points. Thus, can be represented as
where are nonuniform "hat" basis functions (linear B-splines) at knots , and the spline coefficients are trainable. The slope on each interval is a simple finite difference,
while the regularizer becomes the -norm of consecutive slope differences,
Slope constraints are enforced as box constraints on each , mapping directly onto desired activation monotonicity, non-expansiveness, or invertibility.
In exponent-nonlinear parameterizations (Chadha et al., 2019), each convolutional edge learns a weight for the operation . All are stacked into a matrix corresponding with the convolutional kernel in shape. For training, is initialized to $1$ (recovering standard convolution at initialization) or randomly within a specified range, and then projected to after each update.
3. Training and Algorithmic Implementation
For TV-regularized spline nonlinearities, network training interleaves SGD or Adam updates for the linear parameters and all spline coefficient vectors , usually in batches. For each update:
- Forward propagation evaluates efficiently, since only two B-splines are nonzero per .
- The combined loss penalizes data error, the TV term, and incurs a barrier for slope-violating .
- Gradients are computed w.r.t. both and .
- After each update, slope vectors are projected into , and are updated via discrete integration, maintaining the mean value for identifiability.
Exponent-parameterized edge nonlinearities are trained by optimizing both the standard kernel weights and the exponent tensor . Each patch-wise operation is differentiable; gradients are: which backpropagate readily via automatic differentiation. Numerical stability requires or appropriate domain shifts. Regularization may be applied to keep near unity unless more complex nonlinearity is needed.
For edge-wise ReLU biases as in rectified wire networks (Elser et al., 2018), only biases on edges are trained (not the positive node weights), learning proceeds by convex-constrained quadratic programs or the efficient sequential deactivation algorithm, ensuring monotonic updates and rich expressivity.
4. Computational Considerations
The main memory overhead for spline-based nonlinearities is for the spline coefficients; choice of –$100$ balances flexibility and efficiency, with unused knots pruned automatically by the sparsity-inducing TV term (Unser et al., 23 Aug 2024). Forward and backward passes per scalar input per activation cost .
Exponent-based learnable nonlinearities increase computation by requiring a single elementwise power and possibly log operation per edge per input sample; this leads to a run-time increase of 10–20% compared to standard convolution, but minimal parameter overhead (Chadha et al., 2019).
For edge-wise rectified wire networks, learning requires quadratic programming or the sequential deactivation procedure, but each SDA iteration costs and learning an item costs at most (Elser et al., 2018).
5. Empirical Findings and Performance
When replacing standard activations (ReLU, PReLU) with TV-regularized, edge-wise spline nonlinearities in imaging inverse problems (denoising, deblurring, MRI reconstruction), observed PSNR gains are $0.3$–$0.5$ dB compared to fixed activations. Loosening slope constraints can provide an additional $0.15$ dB. Layer-wise (shared per layer) splines perform about $0.2$ dB worse than edge-wise parameterizations (Unser et al., 23 Aug 2024). This confirms the efficacy and practical feasibility of edge-wise, data-driven activation learning.
For convolutional architectures on time series classification tasks, per-edge learnable power nonlinearities achieve significant accuracy improvements: from 78.2% (baseline CNN) to 85.7% (trained per-edge exponents), surpassing both input-augmentation and fixed random exponent strategies. Analysis of the learned exponents shows concentration in but with some exponents in $2$–$3$ (squared/cubic) or below $0.5$ (root-like), enabling nuanced local adaptation to data complexity (Chadha et al., 2019).
Rectified wire networks with edge-wise learned biases achieve test accuracy up to on binarized MNIST using sparse expanders and attain performance within a few percent of Bayes-optimal on sequence tasks. Notably, representational capacity is not limited: all Boolean functions are representable by bias learning alone (Elser et al., 2018).
6. Architectural Variants and Extensions
Edge-wise learnable univariate nonlinearities can be instantiated as:
- Continuous linear splines (with TV smoothing and slope constraints) (Unser et al., 23 Aug 2024).
- Learnable parametric nonlinearities such as elementwise exponents, absolute values, or Gaussians (Chadha et al., 2019).
- Per-edge ReLU biases (rectified wire approach) (Elser et al., 2018).
- Generalized coupled activations via parametrized matrix transformations of input vectors, which generalize the univariate, edge-wise paradigm (Chadha et al., 2019).
Alternative pixelwise parameterizations, Kronecker sharing, and lookup-table activations have been proposed as extensions, but computational and statistical efficiency remains critical.
7. Theoretical and Practical Significance
The TV-regularized framework guarantees that the global minimizer is a spline with few knots, enabling precise regularization and interpretability. Slope constraints ensure critical structural properties such as Lipschitzness and invertibility, which are essential in plug-and-play, unrolled optimization, and flow-based models.
Edge-wise trainable nonlinearities dramatically increase the representation power of neural architectures, adaptively capturing local data geometry. This gain is most pronounced on heterogeneous, nonlinear data such as time series or inverse imaging tasks.
Limitations include increased computational cost for expensive nonlinearity parameterizations (especially exponent/log operations), the risk of overfitting due to large numbers of parameters, and potential numerical instabilities when operating at or near the boundary of admissible parameter ranges (Chadha et al., 2019). For edgewise ReLU networks, depth exploitation is empirically modest—most gains accrue at the single-layer level (Elser et al., 2018).
The effectiveness of these models is contingent upon joint optimization strategies that correctly couple the nonlinearities with the network weights, rigorous regularization, and computationally aware implementation. These approaches extend the design space of neural architectures, offering pathways to richer representations, theoretical guarantees, and principled adaptation to intricate modeling scenarios.