- The paper introduces a novel power transform unifying diverse mathematical concepts like loss functions, kernels, probability distributions, and neural network activations.
- The paper presents a fast and stable algorithm with a reference implementation for the power transform, addressing previous numerical issues.
- Specific cases of this transform generate various robust loss functions, common probability distributions, stationary kernels, and neural network activations.
The paper introduces a novel power transform, serving as a unifying framework for loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions. The author identifies numerical stability and speed issues with previous iterations of this power transform and presents a fast and stable algorithm alongside a reference implementation.
The "root" form of the power transform is defined as:
f(x,λ)≜2−∣λ∣+λ2∣λ∣⋅((1+2∣λ∣2−∣λ∣−λx)1−∣λ∣1−sgn(λ)−1)
where:
- f: the power transform function
- x: the input value
- λ: the transform parameter
- sgn: the sign function
This equation exhibits removable singularities, with specific cases defined for λ=+∞,1,0,−1,−∞, and general forms for 0<λ<+∞ and −∞<λ<0. Several special cases of f(x,λ) correspond to curves and transforms commonly used in various contexts. For instance, λ=0 corresponds to the identity, f0 to a "padded" square root, and f1 to a shifted and scaled quadratic function. When f2, the transform is a normalized version of the Box-Cox transform.
The transform is monotonic with regards to both f3 and f4, and it is self-inverting: f5.
Applying f6 to a quadratic (L2) loss function f7 with a scale parameter f8 yields a family of robust losses:
f9
where:
- x0: the robust loss function
- x1: the input value
- x2: the transform parameter
- x3: the scale parameter
This family includes L2 loss, Cauchy loss, and Welsch loss. Charbonnier and Geman-McClure losses are also special cases. As x4 decreases from x5 to x6, the loss function becomes more robust. This family of loss functions is a superset of a previous family of robust losses the author explored, which corresponds to this family where x7, and is also a superset of the "Smooth Exponential Family" of loss functions.
Taking the derivative of the loss function x8 and scaling it by x9 yields a family of stationary kernel functions:
λ0
where:
- λ1: the stationary kernel function
- λ2: the Euclidean distance between two points
- λ3: the transform parameter
- λ4: the scale parameter
Several cases are commonly used kernels. When λ5, this kernel is equivalent to a rescaled Student's t-distribution. The kernel, ignoring the λ6 scaling, is equivalent to the weights used by iteratively reweighted least-squares when minimizing λ7.
Exponentiating the loss function λ8 and normalizing by the integral yields a family of probability distributions:
λ9, sgn0
where:
- sgn1: the probability distribution
- sgn2: the input value
- sgn3: the transform parameter
- sgn4: the scale parameter
- sgn5: the normalization factor
This family includes the Epanechnikov, Normal, "smooth Laplace", and Cauchy distributions. The distribution is undefined when sgn6, and it has a bounded support when sgn7.
When sgn8, the non-normalized probability distributions form a family of bump functions:
sgn9
where:
- λ=+∞,1,0,−1,−∞0: the bump function
- λ=+∞,1,0,−1,−∞1: the input value
- λ=+∞,1,0,−1,−∞2: the transform parameter
The input to λ=+∞,1,0,−1,−∞3 is rescaled such that all bumps have a support of λ=+∞,1,0,−1,−∞4. As λ=+∞,1,0,−1,−∞5 approaches λ=+∞,1,0,−1,−∞6 from above, λ=+∞,1,0,−1,−∞7 approaches the Dirac delta.
λ=+∞,1,0,−1,−∞8 can be generalized from non-negative inputs to all real numbers by introducing a free parameter λ=+∞,1,0,−1,−∞9 for negative inputs (and renaming 0<λ<+∞0 to 0<λ<+∞1), yielding a two-parameter family of functions:
0<λ<+∞2
where:
- 0<λ<+∞3: the generalized power transform function
- 0<λ<+∞4: the input value
- 0<λ<+∞5: the transform parameter for positive inputs
- 0<λ<+∞6: the transform parameter for negative inputs
0<λ<+∞7, 0<λ<+∞8, and the ELU activation are members of this family. Many sigmoid-shaped functions can be expressed as the composition of the Box-Cox transform and its inverse, and this approach can similarly be used to ground the softplus, sigmoid, tanh, and ReLU activations in terms of 0<λ<+∞9.
The power transform presented here is related to the Box-Cox transform. −∞<λ<00 is a variant of −∞<λ<01 where the top half of the transform −∞<λ<02 is replaced with the inverse of the transform for −∞<λ<03. There exists a bijection between the two transforms.
−∞<λ<04 can and should be implemented using the −∞<λ<05 and −∞<λ<06 functions. The stable implementation yields significantly more accurate results, especially when −∞<λ<07 is near −∞<λ<08 and when −∞<λ<09.