- The paper introduces a novel power transform unifying diverse mathematical concepts like loss functions, kernels, probability distributions, and neural network activations.
- The paper presents a fast and stable algorithm with a reference implementation for the power transform, addressing previous numerical issues.
- Specific cases of this transform generate various robust loss functions, common probability distributions, stationary kernels, and neural network activations.
The paper introduces a novel power transform, serving as a unifying framework for loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions. The author identifies numerical stability and speed issues with previous iterations of this power transform and presents a fast and stable algorithm alongside a reference implementation.
The "root" form of the power transform is defined as:
f(x,λ)≜2−∣λ∣+λ2∣λ∣⋅((1+2∣λ∣2−∣λ∣−λx)1−∣λ∣1−sgn(λ)−1)
where:
- %%%%1%%%%: the power transform function
- x: the input value
- λ: the transform parameter
- sgn: the sign function
This equation exhibits removable singularities, with specific cases defined for λ=+∞,1,0,−1,−∞, and general forms for 0<λ<+∞ and −∞<λ<0. Several special cases of f(x,λ) correspond to curves and transforms commonly used in various contexts. For instance, λ=0 corresponds to the identity, λ=−21 to a "padded" square root, and λ=21 to a shifted and scaled quadratic function. When λ∈(−∞,1), the transform is a normalized version of the Box-Cox transform.
The transform is monotonic with regards to both x and λ, and it is self-inverting: f−1(x,λ)=f(x,−λ).
Applying f(x,λ) to a quadratic (L2) loss function 21(cx)2 with a scale parameter c>0 yields a family of robust losses:
ρ(x,λ,c)≜f(21⋅(cx)2,λ)
where:
- ρ: the robust loss function
- x: the input value
- λ: the transform parameter
- c: the scale parameter
This family includes L2 loss, Cauchy loss, and Welsch loss. Charbonnier and Geman-McClure losses are also special cases. As λ decreases from $0$ to −∞, the loss function becomes more robust. This family of loss functions is a superset of a previous family of robust losses the author explored, which corresponds to this family where λ≤1, and is also a superset of the "Smooth Exponential Family" of loss functions.
Taking the derivative of the loss function ρ(x,λ,c) and scaling it by xc2 yields a family of stationary kernel functions:
k(x,λ,c)≜xc2⋅∂x∂ρ(x,λ,c)
where:
- k: the stationary kernel function
- x: the Euclidean distance between two points
- λ: the transform parameter
- c: the scale parameter
Several cases are commonly used kernels. When λ<−21, this kernel is equivalent to a rescaled Student's t-distribution. The kernel, ignoring the c2 scaling, is equivalent to the weights used by iteratively reweighted least-squares when minimizing ρ(x,λ,c).
Exponentiating the loss function ρ(x,λ,c) and normalizing by the integral yields a family of probability distributions:
P(x,λ,c)=c⋅Z(λ)1exp(−ρ(x,λ,c)), Z(λ)=∫−∞∞exp(−ρ(x,λ,1))dx
where:
- P: the probability distribution
- x: the input value
- λ: the transform parameter
- c: the scale parameter
- Z: the normalization factor
This family includes the Epanechnikov, Normal, "smooth Laplace", and Cauchy distributions. The distribution is undefined when λ<−1, and it has a bounded support when λ>1.
When λ∈(1,∞), the non-normalized probability distributions form a family of bump functions:
b(x,λ)=exp(−f(λ−1λx2,λ))
where:
- b: the bump function
- x: the input value
- λ: the transform parameter
The input to f(x,λ) is rescaled such that all bumps have a support of [−1,1]. As λ approaches $1$ from above, b(x,λ) approaches the Dirac delta.
f(x,λ) can be generalized from non-negative inputs to all real numbers by introducing a free parameter λ− for negative inputs (and renaming λ to λ+), yielding a two-parameter family of functions:
f±(x,λ+,λ−)={f(x,λ+)if x≥0 −f(−x,λ−)if x<0
where:
- f±: the generalized power transform function
- x: the input value
- λ+: the transform parameter for positive inputs
- λ−: the transform parameter for negative inputs
exp(x)−1, log(1+x), and the ELU activation are members of this family. Many sigmoid-shaped functions can be expressed as the composition of the Box-Cox transform and its inverse, and this approach can similarly be used to ground the softplus, sigmoid, tanh, and ReLU activations in terms of f(x,λ).
The power transform presented here is related to the Box-Cox transform. f(x,λ) is a variant of h^(x,λ) where the top half of the transform λ>1 is replaced with the inverse of the transform for λ<1. There exists a bijection between the two transforms.
f(x,λ) can and should be implemented using the expm1(x)=exp(x)−1 and log1p(x)=log(1+x) functions. The stable implementation yields significantly more accurate results, especially when λ is near ±1 and when ∣λ∣≫1.