Papers
Topics
Authors
Recent
2000 character limit reached

Edge-wise Learnable Nonlinearities

Updated 4 December 2025
  • The paper introduces edge-wise learnable nonlinearities, showing that parameterizing each activation via adaptive splines or exponent functions improves network expressiveness.
  • Edge-wise learnable univariate nonlinearities are defined as individual, trainable activation functions that enforce constraints like monotonicity and Lipschitz continuity through TV² regularization.
  • Empirical results reveal that these adaptive methods yield notable improvements in tasks such as inverse imaging and time-series classification, with enhanced PSNR and accuracy over fixed activations.

Edge-wise learnable univariate nonlinearities refer to the paradigm in neural networks and related layered computational architectures where the nonlinear activation function on each edge (or neuron) is parameterized separately and can be learned from data, possibly subject to functional or structural constraints. In contrast to conventional architectures with a fixed, shared nonlinearity such as ReLU or sigmoid per layer or network, this approach endows each edge with its own adaptive univariate function, enabling highly expressive, data-driven modeling of complex transformations, increased flexibility, and—when properly regularized—strong theoretical guarantees regarding behavior such as monotonicity, Lipschitz continuity, and stability under composition.

1. Mathematical Formulation and Optimization

Edge-wise learnable univariate nonlinearities are incorporated by associating with each edge eEe \in \mathcal E of the network its own trainable function fe:RRf_e:\mathbb{R}\to\mathbb{R}. The general variational problem consists of jointly optimizing the collection of all linear parameters θ\theta and activations F={fe}eE\mathcal F = \{f_e\}_{e\in\mathcal E} against a task loss L(θ,F)L(\theta, \mathcal F), standard linear regularizer R(θ)R(\theta), and a differentiable structure-inducing regularizer on the activations. A principled formulation is given in (Unser et al., 23 Aug 2024): minθ,{feBV(2)(R)}L(θ,{fe})+R(θ)+λeEfeMs.t.sminfe(x)smax a.e.\min_{\theta,\{f_e\in\mathrm{BV}^{(2)}(\mathbb R)\}} \quad L(\theta,\{f_e\}) + R(\theta) + \lambda \sum_{e\in\mathcal E}\|f''_e\|_{\mathcal M} \quad\text{s.t.}\quad s_{\min} \le f_e'(x) \le s_{\max}\ \text{a.e.} where BV(2)(R)\mathrm{BV}^{(2)}(\mathbb R) denotes the space of functions of bounded second variation (the minimal space where the second-order total variation is defined), and the slope-box constraint enforces properties such as monotonicity, 1-Lipschitzness, or firm non-expansiveness on each fef_e. The regularization term is the Radon norm of the (distributional) second derivative,

feM=TV(2)(fe),\|f_e''\|_{\mathcal M} = \mathrm{TV}^{(2)}(f_e),

which encourages the learned activations to be piecewise-affine.

For exponent-parameterized edge nonlinearities, as in the learnable power-nonlinearity approach of (Chadha et al., 2019), each edge is endowed with a learned exponent aia_i, parameterizing operations such as xxaix \mapsto x^{a_i}, with suitable constraints on aia_i to maintain numerical stability.

2. Parameterization and Spline Representer Results

In the TV(2){}^{(2)}-regularized setting (Unser et al., 23 Aug 2024), a central result is that the optimizer over each fef_e is an adaptive linear spline with at most M1M-1 knots, given MM data points. Thus, fef_e can be represented as

fe(x)=n=1Nce,nBn(x),f_e(x) = \sum_{n=1}^N c_{e,n} B_n(x),

where Bn(x)B_n(x) are nonuniform "hat" basis functions (linear B-splines) at knots tnt_n, and the spline coefficients ce,nc_{e, n} are trainable. The slope on each interval is a simple finite difference,

se,n=ce,nce,n1tntn1,s_{e, n} = \frac{c_{e, n} - c_{e, n-1}}{t_n - t_{n-1}},

while the regularizer becomes the 1\ell_1-norm of consecutive slope differences,

n=2N1se,n+1se,n.\sum_{n=2}^{N-1} |s_{e, n+1} - s_{e, n}|.

Slope constraints are enforced as box constraints on each se,ns_{e, n}, mapping directly onto desired activation monotonicity, non-expansiveness, or invertibility.

In exponent-nonlinear parameterizations (Chadha et al., 2019), each convolutional edge learns a weight aia_i for the operation xiaix_i^{a_i}. All aia_i are stacked into a matrix AA corresponding with the convolutional kernel WW in shape. For training, AA is initialized to $1$ (recovering standard convolution at initialization) or randomly within a specified range, and then projected to [vmin,vmax][v_{\min},v_{\max}] after each update.

3. Training and Algorithmic Implementation

For TV(2){}^{(2)}-regularized spline nonlinearities, network training interleaves SGD or Adam updates for the linear parameters θ\theta and all spline coefficient vectors cec_e, usually in batches. For each update:

  • Forward propagation evaluates fe(x)f_e(x) efficiently, since only two B-splines are nonzero per xx.
  • The combined loss penalizes data error, the TV(2){}^{(2)} term, and incurs a barrier for slope-violating se,ns_{e, n}.
  • Gradients are computed w.r.t. both θ\theta and ce,nc_{e, n}.
  • After each update, slope vectors ses_e are projected into [smin,smax]N[s_{\min}, s_{\max}]^N, and cec_e are updated via discrete integration, maintaining the mean value for identifiability.

Exponent-parameterized edge nonlinearities are trained by optimizing both the standard kernel weights WW and the exponent tensor AA. Each patch-wise operation xax^a is differentiable; gradients are: yai=wixiailnxi,ywi=xiai,\frac{\partial y}{\partial a_i} = w_i x_i^{a_i} \ln x_i, \quad \frac{\partial y}{\partial w_i} = x_i^{a_i}, which backpropagate readily via automatic differentiation. Numerical stability requires xi>0x_i > 0 or appropriate domain shifts. Regularization may be applied to keep AA near unity unless more complex nonlinearity is needed.

For edge-wise ReLU biases as in rectified wire networks (Elser et al., 2018), only biases on edges are trained (not the positive node weights), learning proceeds by convex-constrained quadratic programs or the efficient sequential deactivation algorithm, ensuring monotonic updates and rich expressivity.

4. Computational Considerations

The main memory overhead for spline-based nonlinearities is E×N|\mathcal E| \times N for the spline coefficients; choice of N=20N = 20–$100$ balances flexibility and efficiency, with unused knots pruned automatically by the sparsity-inducing TV(2){}^{(2)} term (Unser et al., 23 Aug 2024). Forward and backward passes per scalar input per activation cost O(1)O(1).

Exponent-based learnable nonlinearities increase computation by requiring a single elementwise power and possibly log operation per edge per input sample; this leads to a run-time increase of 10–20% compared to standard convolution, but minimal parameter overhead (Chadha et al., 2019).

For edge-wise rectified wire networks, learning requires quadratic programming or the sequential deactivation procedure, but each SDA iteration costs O(A)O(|A|) and learning an item costs at most O(E2)O(|E|^2) (Elser et al., 2018).

5. Empirical Findings and Performance

When replacing standard activations (ReLU, PReLU) with TV(2){}^{(2)}-regularized, edge-wise spline nonlinearities in imaging inverse problems (denoising, deblurring, MRI reconstruction), observed PSNR gains are $0.3$–$0.5$ dB compared to fixed activations. Loosening slope constraints can provide an additional $0.15$ dB. Layer-wise (shared per layer) splines perform about $0.2$ dB worse than edge-wise parameterizations (Unser et al., 23 Aug 2024). This confirms the efficacy and practical feasibility of edge-wise, data-driven activation learning.

For convolutional architectures on time series classification tasks, per-edge learnable power nonlinearities achieve significant accuracy improvements: from 78.2% (baseline CNN) to 85.7% (trained per-edge exponents), surpassing both input-augmentation and fixed random exponent strategies. Analysis of the learned exponents shows concentration in [0.8,1.2][0.8, 1.2] but with some exponents in $2$–$3$ (squared/cubic) or below $0.5$ (root-like), enabling nuanced local adaptation to data complexity (Chadha et al., 2019).

Rectified wire networks with edge-wise learned biases achieve test accuracy up to 95%\sim95\% on binarized MNIST using sparse expanders and attain performance within a few percent of Bayes-optimal on sequence tasks. Notably, representational capacity is not limited: all Boolean functions are representable by bias learning alone (Elser et al., 2018).

6. Architectural Variants and Extensions

Edge-wise learnable univariate nonlinearities can be instantiated as:

  • Continuous linear splines (with TV(2){}^{(2)} smoothing and slope constraints) (Unser et al., 23 Aug 2024).
  • Learnable parametric nonlinearities such as elementwise exponents, absolute values, or Gaussians (Chadha et al., 2019).
  • Per-edge ReLU biases (rectified wire approach) (Elser et al., 2018).
  • Generalized coupled activations via parametrized matrix transformations of input vectors, which generalize the univariate, edge-wise paradigm (Chadha et al., 2019).

Alternative pixelwise parameterizations, Kronecker sharing, and lookup-table activations have been proposed as extensions, but computational and statistical efficiency remains critical.

7. Theoretical and Practical Significance

The TV(2){}^{(2)}-regularized framework guarantees that the global minimizer is a spline with few knots, enabling precise regularization and interpretability. Slope constraints ensure critical structural properties such as Lipschitzness and invertibility, which are essential in plug-and-play, unrolled optimization, and flow-based models.

Edge-wise trainable nonlinearities dramatically increase the representation power of neural architectures, adaptively capturing local data geometry. This gain is most pronounced on heterogeneous, nonlinear data such as time series or inverse imaging tasks.

Limitations include increased computational cost for expensive nonlinearity parameterizations (especially exponent/log operations), the risk of overfitting due to large numbers of parameters, and potential numerical instabilities when operating at or near the boundary of admissible parameter ranges (Chadha et al., 2019). For edgewise ReLU networks, depth exploitation is empirically modest—most gains accrue at the single-layer level (Elser et al., 2018).

The effectiveness of these models is contingent upon joint optimization strategies that correctly couple the nonlinearities with the network weights, rigorous regularization, and computationally aware implementation. These approaches extend the design space of neural architectures, offering pathways to richer representations, theoretical guarantees, and principled adaptation to intricate modeling scenarios.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Edge-wise Learnable Univariate Nonlinearities.