Papers
Topics
Authors
Recent
Search
2000 character limit reached

ICE: Bias-Corrected Parameter Estimation

Updated 11 May 2026
  • Information-Corrected Estimator (ICE) is a method that adjusts the empirical log-likelihood with a data-driven trace correction to counteract KL divergence bias.
  • It reduces generalization error from O(1/n) to O(n^(-3/2)) by effectively canceling the leading bias term without the need for hyperparameter tuning.
  • ICE applies to any twice-differentiable model and has demonstrated robust improvements in both classical parametric models and modern neural network architectures.

Information-Corrected Estimator (ICE) is a parameter estimation procedure designed to reduce generalization error in supervised machine learning by explicitly correcting for the Kullback–Leibler (KL) divergence generalization bias intrinsic to standard maximum likelihood estimation (MLE) and L2L_2 (ridge) regularization. The ICE estimator augments the loss function with a data-driven trace correction, requiring no tuning or hyper-parameter search, and applies to any model with a twice-differentiable likelihood. Empirical studies and distributed implementations in modern neural network frameworks demonstrate robust out-of-sample performance improvements over unregularized MLE and ridge, especially in moderate sample regimes and complex model classes (Dixon et al., 2018, Ward, 2020).

1. Formal Definition and Objective

ICE modifies the standard empirical log-likelihood objective by introducing a finite-sample correction based on Fisher information and the observed Hessian. Given IID data {xi}i=1n\{x_i\}_{i=1}^n and a parametric model g(xθ)g(x \mid \theta), the average log-likelihood is

(θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).

Define the empirical Fisher information and observed Hessian as

I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,

J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).

The ICE objective is

(θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),

with the ICE estimator

θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.

No additional penalty or tuning parameter is introduced.

2. Theoretical Motivation: KL Divergence and Bias Correction

ICE is derived to target the out-of-sample KL divergence from the true data law f(x)f(x) to g(xθ)g(x \mid \theta): {xi}i=1n\{x_i\}_{i=1}^n0 MLE underestimates out-of-sample error due to an optimistic {xi}i=1n\{x_i\}_{i=1}^n1 bias in the empirical log-likelihood. Under regularity and smoothness conditions (including White’s conditions), a second-order expansion shows that

{xi}i=1n\{x_i\}_{i=1}^n2

where {xi}i=1n\{x_i\}_{i=1}^n3 and {xi}i=1n\{x_i\}_{i=1}^n4 are the population Fisher information and Hessian. By adding the empirical counterpart {xi}i=1n\{x_i\}_{i=1}^n5 to the objective, ICE cancels the leading bias term, reducing the generalization error’s order from {xi}i=1n\{x_i\}_{i=1}^n6 to {xi}i=1n\{x_i\}_{i=1}^n7 (Dixon et al., 2018).

3. Theoretical Guarantees

The central properties of ICE are as follows:

  • Asymptotic Normality: Under standard regularity conditions and for the population minimizer {xi}i=1n\{x_i\}_{i=1}^n8,

{xi}i=1n\{x_i\}_{i=1}^n9

where g(xθ)g(x \mid \theta)0 are the “corrected” information and Hessian tensors associated with g(xθ)g(x \mid \theta)1.

  • Bias Reduction: The expectation of the ICE-corrected loss at the minimizer satisfies

g(xθ)g(x \mid \theta)2

while ordinary MLE has an g(xθ)g(x \mid \theta)3 discrepancy. Thus, ICE achieves a strictly smaller asymptotic bias at finite g(xθ)g(x \mid \theta)4 (Dixon et al., 2018).

4. Algorithmic Implementation and Approximations

Optimization proceeds by iterative minimization of g(xθ)g(x \mid \theta)5. At each iteration:

  • Compute per-sample gradients and Hessians (or their diagonals).
  • Form empirical estimators g(xθ)g(x \mid \theta)6 and g(xθ)g(x \mid \theta)7.
  • Compute the correction term g(xθ)g(x \mid \theta)8.
  • Update parameters via a quasi-Newton or similar gradient-based solver.

Complexity per iteration depends on the required matrix operations:

Step Time Complexity Memory
Forming g(xθ)g(x \mid \theta)9, (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).0 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).1 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).2
Inverting (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).3 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).4 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).5
Diagonal approximation to (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).6 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).7 (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).8

Practical implementations, such as in multilayer perceptron (MLP) models, use a diagonal approximation (θ)=1ni=1nlogg(xiθ).\ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log g(x_i\mid\theta).9 to obtain

I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,0

which matches the computational cost of vanilla backpropagation up to a small multiplicative factor (Ward, 2020).

Numerical stabilization of I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,1 is required to avoid division by near-zero or negative diagonal entries; this is accomplished by truncating small scores and reweighting as detailed in (Ward, 2020).

5. Empirical Results and Applications

ICE has been empirically validated in both classical statistical models and large-scale neural networks:

Parametric models (Dixon et al., 2018):

  • Gaussian location-scale: For sample sizes I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,2, ICE reduces out-of-sample KL divergence by over I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,3 at smallest I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,4, remaining I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,5 ahead of MLE at I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,6. I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,7 regularization was neutral or harmful.
  • Friedman nonlinear regression: ICE reduced KL divergence by I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,8–I^(θ)=1ni=1n[θlogg(xiθ)][θlogg(xiθ)],\hat I(\theta) = \frac{1}{n}\sum_{i=1}^n [\nabla_\theta \log g(x_i\mid\theta)][\nabla_\theta \log g(x_i\mid\theta)]^\top,9 for all J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).0, while J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).1 was only marginally effective for J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).2.
  • Synthetic logistic regression: For J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).3 and J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).4, ICE reduced generalization KL by J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).5–J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).6 when J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).7 was small; this advantage attenuated as J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).8 increased.

MLPs and distributed training (Ward, 2020):

  • Implemented in Apache Spark’s MultilayerPerceptronClassifier via a boolean switch useICE.
  • On the Freddie Mac Single‐Family Loan‐Level dataset (J^(θ)=1ni=1nθ2logg(xiθ).\hat J(\theta) = -\frac{1}{n} \sum_{i=1}^n \nabla^2_\theta \log g(x_i\mid\theta).9 samples), across a range of architecture depths (36, 78, 159, 291 parameters), ICE significantly reduced the train/test generalization gap for (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),0, with test cross-entropy loss improvements statistically significant ((θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),1). For large (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),2, both ICE and MLE converged to similar test error, but ICE consistently had less variability.
  • ICE added (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),3–(θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),4 to the loss/gradient computation per iteration, but required fewer optimization steps and thus similar or lower total runtime.

6. Comparison with MLE and Ridge Regularization

The distinguishing characteristics between ICE, MLE, and (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),5 regularization are outlined as follows:

Method Objective Bias Order Hyper-parameters Applicability Overfitting control
MLE (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),6 (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),7 None Any parametric model Poor at small (θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),8, overfits
Ridge ((θ)=(θ)+1ntr(I^(θ)J^(θ)1),-\ell^*(\theta) = -\ell(\theta) + \frac{1}{n}\mathrm{tr}\bigl(\hat I(\theta)\,\hat J(\theta)^{-1}\bigr),9) θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.0 θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.1-dependent θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.2 Linear, small-moderate θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.3 Needs tuning, unreliable in nonlinear settings
ICE θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.4 θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.5 None Any twice-differentiable model Data-driven, no tuning required, robust at small-moderate θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.6

MLE is optimistically biased, particularly in highly-parameterized or small-sample regimes. Ridge regularization requires hyper-parameter selection, commonly via cross-validation, and is less robust for non-Gaussian or highly nonlinear models. ICE directly removes θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.7 generalization bias, achieving θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.8 bias without requiring model-specific hyper-parameters, and derives its correction entirely from the data.

7. Practical Deployment and Usage Guidelines

ICE is suitable for scenarios where the prevention of overfitting is paramount and hyper-parameter free operation is desired:

  • Particularly advantageous for small to moderate sample sizes (θ^ICE=argminθ{(θ)}.\hat\theta_{\rm ICE} = \arg\min_\theta \{-\ell^*(\theta)\}.9) and models with many parameters (f(x)f(x)0).
  • In distributed frameworks (e.g., Spark ML, via setUseICE(true)), ICE is drop-in and adds one additional per-iteration pass to accumulate diagonal Hessians.
  • In custom implementations, a single backward pass should accumulate both gradient and diagonal Hessian per parameter, and apply the ICE correction within the loss function.
  • For extremely small f(x)f(x)1, diagonal approximation may be restrictive; block-diagonal or low-rank alternatives can be explored but with increased computational cost.
  • ICE integrates less naturally with explicit regularization such as dropout or f(x)f(x)2 penalties due to potential double penalization effects.

ICE is statistically robust in real-world distributed neural architectures, requires minimal code changes, incurs negligible memory overhead, and is validated as a practical alternative to unregularized likelihood optimization wherever model selection or out-of-sample accuracy is critical (Dixon et al., 2018, Ward, 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Corrected Estimator (ICE).