Padé Neurons (Paons)
- Padé neurons (Paons) are neuron models that use learned rational functions to generalize the classic McCulloch–Pitts neuron, enabling diverse nonlinear mappings.
- They integrate smoothly with architectures like ResNet, improving performance in image restoration, compression, and classification while reducing network depth and parameter count.
- Numerical stability is achieved with a smoothed variant that prevents denominator collapse, ensuring robust training and reliable network operation.
Padé neurons ("Paons") generalize the classical McCulloch–Pitts neuron by replacing the canonical linear transform plus pointwise nonlinearity with a learned rational function, drawing from principles of Padé approximation. Each Paon computes a learnable rational mapping from inputs to outputs, significantly enhancing the diversity and strength of element-wise nonlinearities per neuron. The Paon framework subsumes previously proposed nonlinear neuron models—quadratic, generative, super, and operational neurons—as special cases, and integrates robustly with modern deep learning architectures such as ResNet for applications in image restoration, compression, and classification. Empirical evidence confirms that Paon-based architectures provide enhanced layer efficiency and overall performance relative to classic designs, often requiring shallower and more computationally efficient networks to achieve or exceed baseline results (Keleş et al., 7 Jan 2026, Keleş et al., 2024, Molina et al., 2019).
1. Mathematical Formulation
A Padé neuron of order , operating on an input tensor , computes
where
Here, denotes the elementwise -th power, and the operation represents either a standard fully connected weight (in dense layers) or a learned convolution (in CNNs), depending on the context. The coefficients and are trainable parameters.
To address numerical instability when , a smoothed variant Paon is introduced:
This smoothed formulation is crucial in practice, as direct rational forms without smoothing encounter denominator collapse at a high frequency, leading to numerical failures (Keleş et al., 7 Jan 2026).
2. Theoretical Properties and Universality
Padé neurons constitute a universal function class: any non-polynomial rational activation establishes a universal approximator within multilayer networks on compact sets, as proven in related universal approximation theorems (Molina et al., 2019). The flexibility of rational representations enables each neuron to learn distinct nonlinear mappings, expanding expressivity beyond polynomials, Taylor-series–based generative neurons, and fixed activations.
Padé neuron stability is managed by ensuring remains bounded away from zero, a property further enhanced by the smoothing strategy. This control over the denominator guarantees global well-posedness and local Lipschitz continuity, supporting robustness analyses and adaptation of recent robustness certifications (Molina et al., 2019).
3. Special Cases and Relationship to Existing Neuron Models
Paons subsume the following neuron models as specific parameterizations:
| Model | Padé Orders | Active Modules |
|---|---|---|
| McCulloch–Pitts (linear+nonlin) | , | None |
| Quadratic neuron | , | None |
| Generative (Taylor) neuron | , | None |
| Super neuron | , | Shifter |
Super neurons are realized by supplementing the generative form with a learnable spatial "Shifter" module, which can take the form of either kernel-wise or element-wise receptive field shifts, realized by learnable offsets or lightweight deformable convolutions (Keleş et al., 7 Jan 2026, Keleş et al., 2024).
For any choice of , the rational mapping encompasses higher-order functional nonlinearities exceeding the expressiveness of classic Taylor or power-series units.
4. Network Integration and Implementation
Integration of Paons into standard architectures requires replacing the convolution-plus-activation block with a single rational Paon (or Paon) layer. For instance, in ResNet-style architectures, the conventional "conv–BN–ReLU" sequence becomes "Paon–BN" or even omits normalization altogether. The rational neuron parameters , are trainable via standard stochastic gradient descent or adaptive optimizers (e.g., AdamW, Adan), with initialization schemes designed to approximate identity at initialization (e.g., , , all other parameters near-zero) (Keleş et al., 7 Jan 2026, Keleş et al., 2024). Shift convolutional modules are zero-initialized, ensuring no shift is learned unless it is beneficial for training loss.
Forward and backward computations follow from the quotient rule, with derivatives for each parameter and the input given by:
where and are the derivatives of the numerator and denominator polynomials, respectively (Keleş et al., 2024, Molina et al., 2019).
Paon layers incur computational complexity and memory scaling proportional to over standard convolutions, due to the need for one convolution per power in both numerator and denominator. Smoothed Paons also require several extra multiply-adds per output element but moderate increases in memory consumption (Keleş et al., 7 Jan 2026).
5. Empirical Performance and Benchmarks
Empirical evaluation of Paons has spanned a range of image processing and recognition tasks.
- Image super-resolution: Experiments on DIV2K, Set5/14, BSD100, Manga109, Urban100 demonstrate that shallow Paon-based architectures (e.g., residual blocks, 445K parameters) outperform classic ResNet, PAU-Net, SelfONN, SuperONN, and DCN baselines on PSNR, SSIM, and LPIPS metrics. Ablation shows that Paon achieves near quality with 25% fewer parameters, and smoothing is essential to prevent numerical instability (Keleş et al., 7 Jan 2026, Keleş et al., 2024).
- Image compression: Integration of Paon units in anchor architectures (MBT, ELIC) eliminates the need for GDN nonlinearities and reduces BD-rate by up to 6%, while also achieving higher performance with fewer residual blocks in the ELIC architecture (Keleş et al., 7 Jan 2026).
- Classification: On CIFAR-10, 14-layer PadéResNet models with Paon units achieve equal or better accuracy than 20-layer vanilla ResNet20. Training in float16/bfloat16 confirms numerical stability is maintained (Keleş et al., 7 Jan 2026).
- Flexible activation learning: Earlier work with Padé Activation Units (PAUs) showed that even when used purely as parametric activations shared per layer, they can match or exceed performance of fixed activations (ReLU, Swish, etc.) with negligible parameter overhead (Molina et al., 2019).
6. Numerical Stability, Initialization, and Practical Considerations
Stability is a central concern due to the possibility of denominator collapse (). Smoothed denominator variants or enforcing via absolute values are critical for preventing instability and NaNs, avoiding issues encountered thousands of times per layer if unmitigated. Shifter modules, when employed, are initialized to impose no offset unless useful for task loss (Keleş et al., 7 Jan 2026, Keleş et al., 2024). Parameter initialization strategies set all coefficients so the neuron output approximates an identity mapping at the start of training, supporting smooth convergence.
Parameter overhead is modest. For activation-only usage, shared per-layer PAUs with orders add less than 50 parameters to networks as large as VGG-8 (9M params), making the method highly parameter-efficient (Molina et al., 2019). For full Paon layers, the cost grows with , but in practice values are sufficient, balancing flexibility and compute.
Recommended best practices include low initial learning rates for rational parameters, variance-bounded initializations, regularization on denominator coefficients to maintain conditioning, and optional randomized noise injection (RPAU) to regularize and prevent overfitting when used as pointwise activations (Molina et al., 2019).
7. Comparative Perspective and Future Directions
Paons represent a unifying framework that strictly generalizes prior enhanced neuron constructs—quadratic, generative, super, and operational neurons—by subsuming them within the rational function parameterization. This broadens the functional repertoire available to each unit, allowing for greater expressivity per layer or per neuron, and yields shallower, more computationally efficient network designs with demonstrable empirical advantages across diverse vision tasks (Keleş et al., 7 Jan 2026).
Remaining avenues for development include further optimizing compute for hardware-accelerated deployment (especially for rational functions), advancing quantization support, and broadening empirical assessment to other domains such as sequence modeling and reinforcement learning. A plausible implication is that explicit rational units like Paons may accelerate the development of more compact and robust architectures when task requirements favor high local nonlinearity or when model size/depth is a bottleneck.
Key references:
- "Padé Neurons for Efficient Neural Models" (Keleş et al., 7 Jan 2026)
- "PAON: A New Neuron Model using Padé Approximants" (Keleş et al., 2024)
- "Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks" (Molina et al., 2019)