Information-Corrected Estimator (ICE) is a method that adjusts the empirical log-likelihood with a data-driven trace correction to counteract KL divergence bias.
It reduces generalization error from O(1/n) to O(n^(-3/2)) by effectively canceling the leading bias term without the need for hyperparameter tuning.
ICE applies to any twice-differentiable model and has demonstrated robust improvements in both classical parametric models and modern neural network architectures.
Information-Corrected Estimator (ICE) is a parameter estimation procedure designed to reduce generalization error in supervised machine learning by explicitly correcting for the Kullback–Leibler (KL) divergence generalization bias intrinsic to standard maximum likelihood estimation (MLE) and L2 (ridge) regularization. The ICE estimator augments the loss function with a data-driven trace correction, requiring no tuning or hyper-parameter search, and applies to any model with a twice-differentiable likelihood. Empirical studies and distributed implementations in modern neural network frameworks demonstrate robust out-of-sample performance improvements over unregularized MLE and ridge, especially in moderate sample regimes and complex model classes (Dixon et al., 2018, Ward, 2020).
1. Formal Definition and Objective
ICE modifies the standard empirical log-likelihood objective by introducing a finite-sample correction based on Fisher information and the observed Hessian. Given IID data {xi}i=1n and a parametric model g(x∣θ), the average log-likelihood is
ℓ(θ)=n1i=1∑nlogg(xi∣θ).
Define the empirical Fisher information and observed Hessian as
I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,
J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).
The ICE objective is
−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),
with the ICE estimator
θ^ICE=argθmin{−ℓ∗(θ)}.
No additional penalty or tuning parameter is introduced.
2. Theoretical Motivation: KL Divergence and Bias Correction
ICE is derived to target the out-of-sample KL divergence from the true data law f(x) to g(x∣θ): {xi}i=1n0
MLE underestimates out-of-sample error due to an optimistic {xi}i=1n1 bias in the empirical log-likelihood. Under regularity and smoothness conditions (including White’s conditions), a second-order expansion shows that
{xi}i=1n2
where {xi}i=1n3 and {xi}i=1n4 are the population Fisher information and Hessian. By adding the empirical counterpart {xi}i=1n5 to the objective, ICE cancels the leading bias term, reducing the generalization error’s order from {xi}i=1n6 to {xi}i=1n7 (Dixon et al., 2018).
3. Theoretical Guarantees
The central properties of ICE are as follows:
Asymptotic Normality: Under standard regularity conditions and for the population minimizer {xi}i=1n8,
{xi}i=1n9
where g(x∣θ)0 are the “corrected” information and Hessian tensors associated with g(x∣θ)1.
Bias Reduction: The expectation of the ICE-corrected loss at the minimizer satisfies
g(x∣θ)2
while ordinary MLE has an g(x∣θ)3 discrepancy. Thus, ICE achieves a strictly smaller asymptotic bias at finite g(x∣θ)4 (Dixon et al., 2018).
4. Algorithmic Implementation and Approximations
Optimization proceeds by iterative minimization of g(x∣θ)5. At each iteration:
Compute per-sample gradients and Hessians (or their diagonals).
Form empirical estimators g(x∣θ)6 and g(x∣θ)7.
Compute the correction term g(x∣θ)8.
Update parameters via a quasi-Newton or similar gradient-based solver.
Complexity per iteration depends on the required matrix operations:
Step
Time Complexity
Memory
Forming g(x∣θ)9, ℓ(θ)=n1i=1∑nlogg(xi∣θ).0
ℓ(θ)=n1i=1∑nlogg(xi∣θ).1
ℓ(θ)=n1i=1∑nlogg(xi∣θ).2
Inverting ℓ(θ)=n1i=1∑nlogg(xi∣θ).3
ℓ(θ)=n1i=1∑nlogg(xi∣θ).4
ℓ(θ)=n1i=1∑nlogg(xi∣θ).5
Diagonal approximation to ℓ(θ)=n1i=1∑nlogg(xi∣θ).6
ℓ(θ)=n1i=1∑nlogg(xi∣θ).7
ℓ(θ)=n1i=1∑nlogg(xi∣θ).8
Practical implementations, such as in multilayer perceptron (MLP) models, use a diagonal approximation ℓ(θ)=n1i=1∑nlogg(xi∣θ).9 to obtain
which matches the computational cost of vanilla backpropagation up to a small multiplicative factor (Ward, 2020).
Numerical stabilization of I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,1 is required to avoid division by near-zero or negative diagonal entries; this is accomplished by truncating small scores and reweighting as detailed in (Ward, 2020).
5. Empirical Results and Applications
ICE has been empirically validated in both classical statistical models and large-scale neural networks:
Gaussian location-scale: For sample sizes I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,2, ICE reduces out-of-sample KL divergence by over I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,3 at smallest I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,4, remaining I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,5 ahead of MLE at I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,6. I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,7 regularization was neutral or harmful.
Friedman nonlinear regression: ICE reduced KL divergence by I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,8–I^(θ)=n1i=1∑n[∇θlogg(xi∣θ)][∇θlogg(xi∣θ)]⊤,9 for all J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).0, while J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).1 was only marginally effective for J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).2.
Synthetic logistic regression: For J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).3 and J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).4, ICE reduced generalization KL by J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).5–J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).6 when J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).7 was small; this advantage attenuated as J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).8 increased.
Implemented in Apache Spark’s MultilayerPerceptronClassifier via a boolean switch useICE.
On the Freddie Mac Single‐Family Loan‐Level dataset (J^(θ)=−n1i=1∑n∇θ2logg(xi∣θ).9 samples), across a range of architecture depths (36, 78, 159, 291 parameters), ICE significantly reduced the train/test generalization gap for −ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),0, with test cross-entropy loss improvements statistically significant (−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),1). For large −ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),2, both ICE and MLE converged to similar test error, but ICE consistently had less variability.
ICE added −ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),3–−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),4 to the loss/gradient computation per iteration, but required fewer optimization steps and thus similar or lower total runtime.
6. Comparison with MLE and Ridge Regularization
The distinguishing characteristics between ICE, MLE, and −ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),5 regularization are outlined as follows:
Method
Objective
Bias Order
Hyper-parameters
Applicability
Overfitting control
MLE
−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),6
−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),7
None
Any parametric model
Poor at small −ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),8, overfits
Ridge (−ℓ∗(θ)=−ℓ(θ)+n1tr(I^(θ)J^(θ)−1),9)
θ^ICE=argθmin{−ℓ∗(θ)}.0
θ^ICE=argθmin{−ℓ∗(θ)}.1-dependent
θ^ICE=argθmin{−ℓ∗(θ)}.2
Linear, small-moderate θ^ICE=argθmin{−ℓ∗(θ)}.3
Needs tuning, unreliable in nonlinear settings
ICE
θ^ICE=argθmin{−ℓ∗(θ)}.4
θ^ICE=argθmin{−ℓ∗(θ)}.5
None
Any twice-differentiable model
Data-driven, no tuning required, robust at small-moderate θ^ICE=argθmin{−ℓ∗(θ)}.6
MLE is optimistically biased, particularly in highly-parameterized or small-sample regimes. Ridge regularization requires hyper-parameter selection, commonly via cross-validation, and is less robust for non-Gaussian or highly nonlinear models. ICE directly removes θ^ICE=argθmin{−ℓ∗(θ)}.7 generalization bias, achieving θ^ICE=argθmin{−ℓ∗(θ)}.8 bias without requiring model-specific hyper-parameters, and derives its correction entirely from the data.
7. Practical Deployment and Usage Guidelines
ICE is suitable for scenarios where the prevention of overfitting is paramount and hyper-parameter free operation is desired:
Particularly advantageous for small to moderate sample sizes (θ^ICE=argθmin{−ℓ∗(θ)}.9) and models with many parameters (f(x)0).
In distributed frameworks (e.g., Spark ML, via setUseICE(true)), ICE is drop-in and adds one additional per-iteration pass to accumulate diagonal Hessians.
In custom implementations, a single backward pass should accumulate both gradient and diagonal Hessian per parameter, and apply the ICE correction within the loss function.
For extremely small f(x)1, diagonal approximation may be restrictive; block-diagonal or low-rank alternatives can be explored but with increased computational cost.
ICE integrates less naturally with explicit regularization such as dropout or f(x)2 penalties due to potential double penalization effects.
ICE is statistically robust in real-world distributed neural architectures, requires minimal code changes, incurs negligible memory overhead, and is validated as a practical alternative to unregularized likelihood optimization wherever model selection or out-of-sample accuracy is critical (Dixon et al., 2018, Ward, 2020).