ICE: Bias-Corrected Parameter Estimation

Updated 11 May 2026

Information-Corrected Estimator (ICE) is a method that adjusts the empirical log-likelihood with a data-driven trace correction to counteract KL divergence bias.
It reduces generalization error from O(1/n) to O(n^(-3/2)) by effectively canceling the leading bias term without the need for hyperparameter tuning.
ICE applies to any twice-differentiable model and has demonstrated robust improvements in both classical parametric models and modern neural network architectures.

Information-Corrected Estimator (ICE) is a parameter estimation procedure designed to reduce generalization error in supervised machine learning by explicitly correcting for the Kullback–Leibler (KL) divergence generalization bias intrinsic to standard maximum likelihood estimation (MLE) and $L_2$ (ridge) regularization. The ICE estimator augments the loss function with a data-driven trace correction, requiring no tuning or hyper-parameter search, and applies to any model with a twice-differentiable likelihood. Empirical studies and distributed implementations in modern neural network frameworks demonstrate robust out-of-sample performance improvements over unregularized MLE and ridge, especially in moderate sample regimes and complex model classes (Dixon et al., 2018, Ward, 2020).

1. Formal Definition and Objective

ICE modifies the standard empirical log-likelihood objective by introducing a finite-sample correction based on Fisher information and the observed Hessian. Given IID data $\{x_i\}_{i=1}^n$ and a parametric model $g(x \mid \theta)$ , the average log-likelihood is

$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$

Define the empirical Fisher information and observed Hessian as

$\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$

$\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$

The ICE objective is

$-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$

with the ICE estimator

$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$

No additional penalty or tuning parameter is introduced.

2. Theoretical Motivation: KL Divergence and Bias Correction

ICE is derived to target the out-of-sample KL divergence from the true data law $f(x)$ to $g(x \mid \theta)$ : $\{x_i\}_{i=1}^n$ 0 MLE underestimates out-of-sample error due to an optimistic $\{x_i\}_{i=1}^n$ 1 bias in the empirical log-likelihood. Under regularity and smoothness conditions (including White’s conditions), a second-order expansion shows that

$\{x_i\}_{i=1}^n$ 2

where $\{x_i\}_{i=1}^n$ 3 and $\{x_i\}_{i=1}^n$ 4 are the population Fisher information and Hessian. By adding the empirical counterpart $\{x_i\}_{i=1}^n$ 5 to the objective, ICE cancels the leading bias term, reducing the generalization error’s order from $\{x_i\}_{i=1}^n$ 6 to $\{x_i\}_{i=1}^n$ 7 (Dixon et al., 2018).

3. Theoretical Guarantees

The central properties of ICE are as follows:

Asymptotic Normality: Under standard regularity conditions and for the population minimizer $\{x_i\}_{i=1}^n$ 8,

$\{x_i\}_{i=1}^n$ 9

where $g(x \mid \theta)$ 0 are the “corrected” information and Hessian tensors associated with $g(x \mid \theta)$ 1.

Bias Reduction: The expectation of the ICE-corrected loss at the minimizer satisfies

$g(x \mid \theta)$ 2

while ordinary MLE has an $g(x \mid \theta)$ 3 discrepancy. Thus, ICE achieves a strictly smaller asymptotic bias at finite $g(x \mid \theta)$ 4 (Dixon et al., 2018).

4. Algorithmic Implementation and Approximations

Optimization proceeds by iterative minimization of $g(x \mid \theta)$ 5. At each iteration:

Compute per-sample gradients and Hessians (or their diagonals).
Form empirical estimators $g(x \mid \theta)$ 6 and $g(x \mid \theta)$ 7.
Compute the correction term $g(x \mid \theta)$ 8.
Update parameters via a quasi-Newton or similar gradient-based solver.

Complexity per iteration depends on the required matrix operations:

Step	Time Complexity	Memory
Forming $g(x \mid \theta)$ 9, $\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 0	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 1	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 2
Inverting $\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 3	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 4	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 5
Diagonal approximation to $\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 6	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 7	$\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 8

Practical implementations, such as in multilayer perceptron (MLP) models, use a diagonal approximation $\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).$ 9 to obtain

$\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 0

which matches the computational cost of vanilla backpropagation up to a small multiplicative factor (Ward, 2020).

Numerical stabilization of $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 1 is required to avoid division by near-zero or negative diagonal entries; this is accomplished by truncating small scores and reweighting as detailed in (Ward, 2020).

5. Empirical Results and Applications

ICE has been empirically validated in both classical statistical models and large-scale neural networks:

Parametric models (Dixon et al., 2018):

Gaussian location-scale: For sample sizes $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 2, ICE reduces out-of-sample KL divergence by over $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 3 at smallest $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 4, remaining $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 5 ahead of MLE at $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 6. $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 7 regularization was neutral or harmful.
Friedman nonlinear regression: ICE reduced KL divergence by $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 8– $\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,$ 9 for all $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 0, while $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 1 was only marginally effective for $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 2.
Synthetic logistic regression: For $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 3 and $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 4, ICE reduced generalization KL by $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 5– $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 6 when $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 7 was small; this advantage attenuated as $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 8 increased.

MLPs and distributed training (Ward, 2020):

Implemented in Apache Spark’s MultilayerPerceptronClassifier via a boolean switch useICE.
On the Freddie Mac Single‐Family Loan‐Level dataset ( $\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).$ 9 samples), across a range of architecture depths (36, 78, 159, 291 parameters), ICE significantly reduced the train/test generalization gap for $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 0, with test cross-entropy loss improvements statistically significant ( $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 1). For large $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 2, both ICE and MLE converged to similar test error, but ICE consistently had less variability.
ICE added $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 3– $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 4 to the loss/gradient computation per iteration, but required fewer optimization steps and thus similar or lower total runtime.

6. Comparison with MLE and Ridge Regularization

The distinguishing characteristics between ICE, MLE, and $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 5 regularization are outlined as follows:

Method	Objective	Bias Order	Hyper-parameters	Applicability	Overfitting control
MLE	$-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 6	$-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 7	None	Any parametric model	Poor at small $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 8, overfits
Ridge ( $-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),$ 9)	$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 0	$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 1-dependent	$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 2	Linear, small-moderate $\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 3	Needs tuning, unreliable in nonlinear settings
ICE	$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 4	$\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 5	None	Any twice-differentiable model	Data-driven, no tuning required, robust at small-moderate $\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 6

MLE is optimistically biased, particularly in highly-parameterized or small-sample regimes. Ridge regularization requires hyper-parameter selection, commonly via cross-validation, and is less robust for non-Gaussian or highly nonlinear models. ICE directly removes $\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 7 generalization bias, achieving $\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 8 bias without requiring model-specific hyper-parameters, and derives its correction entirely from the data.

7. Practical Deployment and Usage Guidelines

ICE is suitable for scenarios where the prevention of overfitting is paramount and hyper-parameter free operation is desired:

Particularly advantageous for small to moderate sample sizes ( $\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.$ 9) and models with many parameters ( $f(x)$ 0).
In distributed frameworks (e.g., Spark ML, via setUseICE(true)), ICE is drop-in and adds one additional per-iteration pass to accumulate diagonal Hessians.
In custom implementations, a single backward pass should accumulate both gradient and diagonal Hessian per parameter, and apply the ICE correction within the loss function.
For extremely small $f(x)$ 1, diagonal approximation may be restrictive; block-diagonal or low-rank alternatives can be explored but with increased computational cost.
ICE integrates less naturally with explicit regularization such as dropout or $f(x)$ 2 penalties due to potential double penalization effects.

ICE is statistically robust in real-world distributed neural architectures, requires minimal code changes, incurs negligible memory overhead, and is validated as a practical alternative to unregularized likelihood optimization wherever model selection or out-of-sample accuracy is critical (Dixon et al., 2018, Ward, 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method (2018)

Implementing the ICE Estimator in Multilayer Perceptron Classifiers (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Corrected Estimator (ICE).

ICE: Bias-Corrected Parameter Estimation

1. Formal Definition and Objective

2. Theoretical Motivation: KL Divergence and Bias Correction

3. Theoretical Guarantees

4. Algorithmic Implementation and Approximations

5. Empirical Results and Applications

6. Comparison with MLE and Ridge Regularization

7. Practical Deployment and Usage Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ICE: Bias-Corrected Parameter Estimation

1. Formal Definition and Objective

2. Theoretical Motivation: KL Divergence and Bias Correction

3. Theoretical Guarantees

4. Algorithmic Implementation and Approximations

5. Empirical Results and Applications

6. Comparison with MLE and Ridge Regularization

7. Practical Deployment and Usage Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research