Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 161 tok/s Pro
GPT OSS 120B 412 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

A Power Transform (2502.10647v1)

Published 15 Feb 2025 in cs.LG, stat.ML, and stat.TH

Abstract: Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.

Summary

  • The paper introduces a novel power transform unifying diverse mathematical concepts like loss functions, kernels, probability distributions, and neural network activations.
  • The paper presents a fast and stable algorithm with a reference implementation for the power transform, addressing previous numerical issues.
  • Specific cases of this transform generate various robust loss functions, common probability distributions, stationary kernels, and neural network activations.

The paper introduces a novel power transform, serving as a unifying framework for loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions. The author identifies numerical stability and speed issues with previous iterations of this power transform and presents a fast and stable algorithm alongside a reference implementation.

The "root" form of the power transform is defined as:

f(x,λ)2λ2λ+λ((1+2λλ2λx)11λsgn(λ)1)f(x, \lambda) \triangleq \frac{2 |\lambda|}{2 - |\lambda| + \lambda} \cdot \big( \big( 1 + \frac{2 - |\lambda| - \lambda}{2 |\lambda|} x \big)^{\frac{1}{1 - |\lambda|} - \operatorname{sgn}(\lambda)} - 1 \big)

where:

  • %%%%1%%%%: the power transform function
  • xx: the input value
  • λ\lambda: the transform parameter
  • sgn\operatorname{sgn}: the sign function

This equation exhibits removable singularities, with specific cases defined for λ=+,1,0,1,\lambda = +\infty, 1, 0, -1, -\infty, and general forms for 0<λ<+0 < \lambda < +\infty and <λ<0-\infty < \lambda < 0. Several special cases of f(x,λ)f(x, \lambda) correspond to curves and transforms commonly used in various contexts. For instance, λ=0\lambda = 0 corresponds to the identity, λ=12\lambda = -\frac{1}{2} to a "padded" square root, and λ=12\lambda = \frac{1}{2} to a shifted and scaled quadratic function. When λ(,1)\lambda \in (-\infty, 1), the transform is a normalized version of the Box-Cox transform.

The transform is monotonic with regards to both xx and λ\lambda, and it is self-inverting: f1(x,λ)=f(x,λ)f^{-1}(x, \lambda) = f(x, -\lambda).

Applying f(x,λ)f(x, \lambda) to a quadratic (L2) loss function 12(xc)2\frac{1}{2}(\frac{x}{c})^2 with a scale parameter c>0c > 0 yields a family of robust losses:

ρ(x,λ,c)f(12(xc)2,λ)\rho(x, \lambda, c) \triangleq f(\frac{1}{2} \cdot (\frac{x}{c})^2, \lambda)

where:

  • ρ\rho: the robust loss function
  • xx: the input value
  • λ\lambda: the transform parameter
  • cc: the scale parameter

This family includes L2 loss, Cauchy loss, and Welsch loss. Charbonnier and Geman-McClure losses are also special cases. As λ\lambda decreases from $0$ to -\infty, the loss function becomes more robust. This family of loss functions is a superset of a previous family of robust losses the author explored, which corresponds to this family where λ1\lambda \leq 1, and is also a superset of the "Smooth Exponential Family" of loss functions.

Taking the derivative of the loss function ρ(x,λ,c)\rho(x, \lambda, c) and scaling it by c2x\frac{c^2}{x} yields a family of stationary kernel functions:

k(x,λ,c)c2xxρ(x,λ,c)k(x, \lambda, c) \triangleq \frac{c^2}{x} \cdot \frac{\partial}{\partial x} \rho(x, \lambda, c)

where:

  • kk: the stationary kernel function
  • xx: the Euclidean distance between two points
  • λ\lambda: the transform parameter
  • cc: the scale parameter

Several cases are commonly used kernels. When λ<12\lambda < -\frac{1}{2}, this kernel is equivalent to a rescaled Student's t-distribution. The kernel, ignoring the c2c^2 scaling, is equivalent to the weights used by iteratively reweighted least-squares when minimizing ρ(x,λ,c)\rho(x, \lambda, c).

Exponentiating the loss function ρ(x,λ,c)\rho(x, \lambda, c) and normalizing by the integral yields a family of probability distributions:

P(x,λ,c)=1cZ(λ)exp(ρ(x,λ,c))P(x, \lambda, c) = \frac{1}{c \cdot Z(\lambda)} \exp(-\rho(x, \lambda, c)), Z(λ)=exp(ρ(x,λ,1))dx\quad Z(\lambda) = \int_{-\infty}^{\infty} \exp(-\rho(x, \lambda, 1)) \, dx

where:

  • PP: the probability distribution
  • xx: the input value
  • λ\lambda: the transform parameter
  • cc: the scale parameter
  • ZZ: the normalization factor This family includes the Epanechnikov, Normal, "smooth Laplace", and Cauchy distributions. The distribution is undefined when λ<1\lambda < -1, and it has a bounded support when λ>1\lambda > 1.

When λ(1,)\lambda \in (1, \infty), the non-normalized probability distributions form a family of bump functions:

b(x,λ)=exp(f(λλ1x2,λ))b(x, \lambda) = \exp \big( -f( \frac{\lambda}{\lambda - 1} x^{2}, \lambda ) \big)

where:

  • bb: the bump function
  • xx: the input value
  • λ\lambda: the transform parameter The input to f(x,λ)f(x, \lambda) is rescaled such that all bumps have a support of [1,1][-1, 1]. As λ\lambda approaches $1$ from above, b(x,λ)b(x, \lambda) approaches the Dirac delta.

f(x,λ)f(x, \lambda) can be generalized from non-negative inputs to all real numbers by introducing a free parameter λ\lambda_- for negative inputs (and renaming λ\lambda to λ+\lambda_+), yielding a two-parameter family of functions:

f±(x,λ+,λ)={f(x,λ+)if x0 f(x,λ)if x<0f_\pm(x, \lambda_+, \lambda_-) = \begin{cases} f(x, \lambda_+) & \text{if } x \ge 0 \ -f(-x, \lambda_-) & \text{if } x < 0 \end{cases}

where:

  • f±f_\pm: the generalized power transform function
  • xx: the input value
  • λ+\lambda_+: the transform parameter for positive inputs
  • λ\lambda_-: the transform parameter for negative inputs

exp(x)1\exp(x) - 1, log(1+x)\log(1 + x), and the ELU activation are members of this family. Many sigmoid-shaped functions can be expressed as the composition of the Box-Cox transform and its inverse, and this approach can similarly be used to ground the softplus, sigmoid, tanh, and ReLU activations in terms of f(x,λ)f(x, \lambda).

The power transform presented here is related to the Box-Cox transform. f(x,λ)f(x, \lambda) is a variant of h^(x,λ)\hat{h}(x, \lambda) where the top half of the transform λ>1\lambda > 1 is replaced with the inverse of the transform for λ<1\lambda < 1. There exists a bijection between the two transforms.

f(x,λ)f(x, \lambda) can and should be implemented using the expm1(x)=exp(x)1\operatorname{expm1}(x) = \exp(x) - 1 and log1p(x)=log(1+x)\operatorname{log1p}(x) = \log(1 + x) functions. The stable implementation yields significantly more accurate results, especially when λ\lambda is near ±1\pm 1 and when λ1|\lambda| \gg 1.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 posts and received 116 likes.