Yeo–Johnson Transformation

Updated 13 October 2025

The Yeo–Johnson transformation is a parametric, monotonic method that adapts to both negative and positive values to achieve data Gaussianization.
It stabilizes variance and mitigates skewness, making it ideal for fulfilling normality assumptions in classical and machine learning models.
Robust estimation and Bayesian techniques further enhance its performance, enabling applications in federated learning, signal processing, and high-dimensional inference.

The Yeo–Johnson transformation is a family of parametric, monotonic transformations designed to “Gaussianize” data distributed on the real line. Unlike the classical Box–Cox transformation, which is restricted to strictly positive variables, the Yeo–Johnson transformation handles both positive and negative values via a continuous piecewise definition. Its flexibility and robust statistical properties have established it as a standard tool for stabilizing variance, mitigating skewness, and enabling the assumptions of normality required by various statistical methods. The technique has seen broad application across Bayesian transformation selection, robust estimation, signal processing (e.g., SAR despeckling), semiparametric regression, high-dimensional variational inference, and privacy-critical federated learning scenarios.

1. Mathematical Definition and Properties

For a real-valued variable $y$ and transformation parameter $\lambda \in \mathbb{R}$ , the Yeo–Johnson transformation $y^{(\lambda)}$ is defined as:

For $y \geq 0$ :

$y^{(\lambda)} = \begin{cases} \dfrac{(y+1)^\lambda - 1}{\lambda}, & \lambda \neq 0 \ \log(y + 1), & \lambda = 0 \end{cases}$

For $y < 0$ :

$y^{(\lambda)} = \begin{cases} -\dfrac{(-y+1)^{2-\lambda} - 1}{2-\lambda}, & \lambda \neq 2 \ -\log(-y + 1), & \lambda = 2 \end{cases}$

This definition guarantees continuity and differentiability everywhere on $\mathbb{R}$ . The transformation reduces to the identity map for $\lambda=1$ , and to the log transform (or negative log) at special parameter values. In all common usages, an accompanying Jacobian correction is employed when the transformation is used in a probabilistic or likelihood framework:

For $y_i \geq 0$ : $J(y_i, \lambda) = (y_i + 1)^{\lambda - 1}$
For $y_i < 0$ : $J(y_i, \lambda) = (-y_i + 1)^{1 - \lambda}$

2. Statistical Role and Rationale

The core utility of the Yeo–Johnson transformation is to render data distributions more closely aligned to the Gaussian paradigm, thereby facilitating classical modeling assumptions. Developed as a generalization of the Box–Cox family, it preserves essential features such as monotonicity and variance stabilization, while eliminating the positive-support constraint. This feature is particularly advantageous in fields such as climatology, genomics, and remote sensing where response or feature variables may be negative or have mixed signs.

Transformation parameter selection ( $\lambda$ ) is crucial, as its value controls the normalization and Gaussianization effect. Method-of-moments, maximum likelihood, Bayesian, robustified, and minimum-distance strategies have been proposed for selecting $\lambda$ under differing objectives and contamination regimes (Charitidou et al., 2013, Raymaekers et al., 2020).

3. Bayesian Transformation Selection and Inference

In model-based inference, the Yeo–Johnson transformation is a competing candidate for achieving approximate normality among several transformation families—including Box–Cox, Modulus, Dual, Log, and the identity map (Charitidou et al., 2013). The likelihood framework for observation vector $y = (y_1, \dots, y_n)$ and transformation T (with parameter $\lambda_T$ ) is constructed as:

$f(y|\mu_T, \sigma_T^2, \lambda_T, T) \propto (2\pi \sigma_T^2)^{-n/2} \exp\left\{ -\frac{1}{2\sigma_T^2} \sum_i (y_i^{(\lambda_T)} - \mu_T)^2 \right\} \prod_i J(y_i, \lambda_T)$

Here, location and scale are marginalized or integrated out as nuisance parameters, while a prior $\pi(\lambda_T|T)$ is placed on $\lambda_T$ . Prior specification reflects the need for comparability across transformation families. Both power-prior and unit-information (normal) priors are used to ensure compatible informativeness for $\lambda_T$ under different candidate functions.

Posterior inference for $\lambda_T$ is intractable analytically; it is resolved using Markov chain Monte Carlo (MCMC) via a random-walk Metropolis–Hastings algorithm. Model selection among competing families is based on posterior model probabilities:

$\pi(T|y) = \frac{f(y|T)\pi(T)}{\sum_{T'} f(y|T')\pi(T')}$

where $f(y|T) = \int f(y|\lambda_T, T) \pi(\lambda_T|T) d\lambda_T$ .

Empirical findings confirm that the Yeo–Johnson family is especially favored when significant mass at negative values is accompanied by moderate skewness, whereas Box–Cox excels for highly skewed positive data, and Modulus for heavy-tailed distributions (Charitidou et al., 2013).

4. Robust and Computationally Efficient Estimation

The classical maximum likelihood estimator for $\lambda$ is sensitive to outliers, frequently distorting the transformation to bring outlying values towards the center at the expense of Gaussianity in the core data (Raymaekers et al., 2020). Recently, robust estimation approaches have been developed for the Yeo–Johnson transformation:

Rectified Transformation Strategy: Applies a piecewise-linear extension outside robust quantiles to limit undue pull by extreme values.
Minimum Distance and Quantile Mapping: $\lambda$ is chosen to minimize the discrepancy (e.g., Cramér–von Mises or Kolmogorov–Smirnov) between robustly standardized transformed quantiles and standard normal quantiles. Objective and loss functions (notably Tukey’s bisquare) are used to further suppress the influence of outliers.
Reweighted Maximum Likelihood: After initial robust estimation, reweighting is performed so that only core data (judged via median and MAD after transformation) contribute fully to the likelihood.

Simulation studies demonstrate that such robust methodology greatly reduces bias and mean squared error in the estimated $\lambda$ , particularly in the presence of contamination. Real-data applications show improved preservation of “central normality” for the bulk of the data, enhancing interpretability and aiding in subsequent analytic tasks such as principal component analysis or regression (Raymaekers et al., 2020).

5. Applications Across Statistical and Signal Processing Domains

Bayesian and ML Preprocessing

The transformation is central in Bayesian model selection and feature preprocessing pipelines to achieve feature Gaussianization, which is critical for model requirements in both classical and machine learning domains. For example, its use in federated learning (via SecureFedYJ) enables privacy-preserving, pooled-parameter transformation across silos, exploiting the strict convexity of the negative log-likelihood for computationally efficient and robust distributed optimization (Marchand et al., 2022).

High-Dimensional Variational Inference

In high-dimensional Bayesian inference, parameterwise Yeo–Johnson transformations are used to induce more Gaussian-like posterior margins prior to variational approximation. This approach—often referred to as a “copula variational approximation with transformed marginals”—enables more accurate capture of skewness and higher-order moments, improving variational lower bounds and consistency with gold standard MCMC (Smith et al., 2019).

Nonparametric and Semiparametric Modeling

In semiparametric boundary regression, the Yeo–Johnson transformation enables models of the form $\Lambda_{\theta_0}(Y) = h_{\theta_0}(X) + \varepsilon$ , where transformed errors become one-sided and decorrelated from the covariates. Minimum-distance estimation, tailored for this structure, is shown to be uniformly consistent under mild regularity for both random and fixed design points (Neumeyer et al., 2018).

Signal and Image Processing

A specialized application appears in SAR image despeckling, where multiplicative gamma-distributed noise is first made additive via log transformation, then Gaussianized via Yeo–Johnson. Sparse representation-based denoising methods require this Gaussianity for statistical validity; auxiliary matrices encoding local noise and adaptive sparsity priors further enhance performance (Hu et al., 24 Dec 2024).

Time Series and Predictive Modeling

For predictive modeling of global mean temperature or similar time series, the Yeo–Johnson transform is applied alongside differencing and scaling, directly regularizing negative and skewed anomalies. Empirical studies report marked reductions in test RMSE for both simple regressors and more complex learners (e.g., LightGBM), demonstrating its practical efficacy (Niyogi et al., 2023).

6. Implementation Strategies and Algorithmic Considerations

Statistical and computational frameworks for fitting the Yeo–Johnson transformation parameter include:

Random-walk Metropolis–Hastings for Bayesian estimation of $\lambda$ within each candidate family, especially when integrated into joint model selection (Charitidou et al., 2013).
Exponential Search (ExpYJ) Algorithm: Leverages provable convexity of the negative log-likelihood for numerically stable, efficient, and federated optimization of $\lambda$ . In Secure Multiparty Computation (SMC), only aggregate statistics (e.g., sums of transformed variables) are disclosed, preserving client privacy and providing pooled-equivalent results (Marchand et al., 2022).
Closed-form Jacobian Calculation is available for all cases, which simplifies likelihood computation and chain rule application for gradient-based or reparameterization-based optimization (Smith et al., 2019).
Minimum Distance Approaches rely on kernel smoothing, empirical process theory, and robust distance metrics for nonparametric estimation (Neumeyer et al., 2018).
Robustified ML or quantile-matching strategies are important in contaminated or heavy-tailed settings, often using rectification or outlier-immune loss functions (Raymaekers et al., 2020).

Performance considerations dictate algorithm and prior choice. In federated learning or privacy-critical environments, communication costs and SMC protocols determine feasibility. For high-dimensional inference, transformation parameterization and closed-form derivatives are preferred for scalability.

7. Comparative Performance and Limitations

No single transformation family (Yeo–Johnson, Box–Cox, Modulus, Dual, etc.) is universally optimal (Charitidou et al., 2013). Empirical comparisons indicate:

Yeo–Johnson is preferred when input includes negative values or symmetry is only slightly improved post-transformation.
Box–Cox yields best normalization for strictly positive, highly skewed data.
Modulus is effective for heavy-tailed distributions.
For data close to normal, the identity transformation is selected via model probabilities.

Robust estimation procedures and federated optimization (as in SecureFedYJ) address practical limitations involving outliers, privacy, or distributed data. Limitations may arise in data with extreme tail behavior, complex dependencies, or when strict invertibility across the entire support is not maintained.

The Yeo–Johnson transformation provides a mathematically principled, computationally efficient, and robust method for data Gaussianization, with broad applicability across contemporary statistical modeling, machine learning preprocessing, robust estimation, and privacy-preserving inference (Charitidou et al., 2013, Neumeyer et al., 2018, Smith et al., 2019, Raymaekers et al., 2020, Marchand et al., 2022, Niyogi et al., 2023, Hu et al., 24 Dec 2024). Its versatility and adaptiveness to the complete real line account for its wide adoption and continued development in both theoretical and applied contexts.