Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

A Power Transform (2502.10647v1)

Published 15 Feb 2025 in cs.LG, stat.ML, and stat.TH

Abstract: Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.

Summary

The paper introduces a novel power transform unifying diverse mathematical concepts like loss functions, kernels, probability distributions, and neural network activations.
The paper presents a fast and stable algorithm with a reference implementation for the power transform, addressing previous numerical issues.
Specific cases of this transform generate various robust loss functions, common probability distributions, stationary kernels, and neural network activations.

The paper introduces a novel power transform, serving as a unifying framework for loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions. The author identifies numerical stability and speed issues with previous iterations of this power transform and presents a fast and stable algorithm alongside a reference implementation.

The "root" form of the power transform is defined as:

$f(x, \lambda) \triangleq \frac{2 |\lambda|}{2 - |\lambda| + \lambda} \cdot \big( \big( 1 + \frac{2 - |\lambda| - \lambda}{2 |\lambda|} x \big)^{\frac{1}{1 - |\lambda|} - \operatorname{sgn}(\lambda)} - 1 \big)$

where:

%%%%1%%%%: the power transform function
$x$ : the input value
$\lambda$ : the transform parameter
$\operatorname{sgn}$ : the sign function

This equation exhibits removable singularities, with specific cases defined for $\lambda = +\infty, 1, 0, -1, -\infty$ , and general forms for $0 < \lambda < +\infty$ and $-\infty < \lambda < 0$ . Several special cases of $f(x, \lambda)$ correspond to curves and transforms commonly used in various contexts. For instance, $\lambda = 0$ corresponds to the identity, $\lambda = -\frac{1}{2}$ to a "padded" square root, and $\lambda = \frac{1}{2}$ to a shifted and scaled quadratic function. When $\lambda \in (-\infty, 1)$ , the transform is a normalized version of the Box-Cox transform.

The transform is monotonic with regards to both $x$ and $\lambda$ , and it is self-inverting: $f^{-1}(x, \lambda) = f(x, -\lambda)$ .

Applying $f(x, \lambda)$ to a quadratic (L2) loss function $\frac{1}{2}(\frac{x}{c})^2$ with a scale parameter $c > 0$ yields a family of robust losses:

$\rho(x, \lambda, c) \triangleq f(\frac{1}{2} \cdot (\frac{x}{c})^2, \lambda)$

where:

$\rho$ : the robust loss function
$x$ : the input value
$\lambda$ : the transform parameter
$c$ : the scale parameter

This family includes L2 loss, Cauchy loss, and Welsch loss. Charbonnier and Geman-McClure losses are also special cases. As $\lambda$ decreases from $0$ to $-\infty$ , the loss function becomes more robust. This family of loss functions is a superset of a previous family of robust losses the author explored, which corresponds to this family where $\lambda \leq 1$ , and is also a superset of the "Smooth Exponential Family" of loss functions.

Taking the derivative of the loss function $\rho(x, \lambda, c)$ and scaling it by $\frac{c^2}{x}$ yields a family of stationary kernel functions:

$k(x, \lambda, c) \triangleq \frac{c^2}{x} \cdot \frac{\partial}{\partial x} \rho(x, \lambda, c)$

where:

$k$ : the stationary kernel function
$x$ : the Euclidean distance between two points
$\lambda$ : the transform parameter
$c$ : the scale parameter

Several cases are commonly used kernels. When $\lambda < -\frac{1}{2}$ , this kernel is equivalent to a rescaled Student's t-distribution. The kernel, ignoring the $c^2$ scaling, is equivalent to the weights used by iteratively reweighted least-squares when minimizing $\rho(x, \lambda, c)$ .

Exponentiating the loss function $\rho(x, \lambda, c)$ and normalizing by the integral yields a family of probability distributions:

$P(x, \lambda, c) = \frac{1}{c \cdot Z(\lambda)} \exp(-\rho(x, \lambda, c))$ , $\quad Z(\lambda) = \int_{-\infty}^{\infty} \exp(-\rho(x, \lambda, 1)) \, dx$

where:

$P$ : the probability distribution
$x$ : the input value
$\lambda$ : the transform parameter
$c$ : the scale parameter
$Z$ : the normalization factor This family includes the Epanechnikov, Normal, "smooth Laplace", and Cauchy distributions. The distribution is undefined when $\lambda < -1$ , and it has a bounded support when $\lambda > 1$ .

When $\lambda \in (1, \infty)$ , the non-normalized probability distributions form a family of bump functions:

$b(x, \lambda) = \exp \big( -f( \frac{\lambda}{\lambda - 1} x^{2}, \lambda ) \big)$

where:

$b$ : the bump function
$x$ : the input value
$\lambda$ : the transform parameter The input to $f(x, \lambda)$ is rescaled such that all bumps have a support of $[-1, 1]$ . As $\lambda$ approaches $1$ from above, $b(x, \lambda)$ approaches the Dirac delta.

$f(x, \lambda)$ can be generalized from non-negative inputs to all real numbers by introducing a free parameter $\lambda_-$ for negative inputs (and renaming $\lambda$ to $\lambda_+$ ), yielding a two-parameter family of functions:

$f_\pm(x, \lambda_+, \lambda_-) = \begin{cases} f(x, \lambda_+) & \text{if } x \ge 0 \ -f(-x, \lambda_-) & \text{if } x < 0 \end{cases}$

where:

$f_\pm$ : the generalized power transform function
$x$ : the input value
$\lambda_+$ : the transform parameter for positive inputs
$\lambda_-$ : the transform parameter for negative inputs

$\exp(x) - 1$ , $\log(1 + x)$ , and the ELU activation are members of this family. Many sigmoid-shaped functions can be expressed as the composition of the Box-Cox transform and its inverse, and this approach can similarly be used to ground the softplus, sigmoid, tanh, and ReLU activations in terms of $f(x, \lambda)$ .

The power transform presented here is related to the Box-Cox transform. $f(x, \lambda)$ is a variant of $\hat{h}(x, \lambda)$ where the top half of the transform $\lambda > 1$ is replaced with the inverse of the transform for $\lambda < 1$ . There exists a bijection between the two transforms.

$f(x, \lambda)$ can and should be implemented using the $\operatorname{expm1}(x) = \exp(x) - 1$ and $\operatorname{log1p}(x) = \log(1 + x)$ functions. The stable implementation yields significantly more accurate results, especially when $\lambda$ is near $\pm 1$ and when $|\lambda| \gg 1$ .