Papers
Topics
Authors
Recent
Search
2000 character limit reached

Information Geometry and Asymptotic Theory for SMML Estimators

Published 6 Apr 2026 in math.ST | (2604.05241v1)

Abstract: We develop an asymptotic theory for strict minimum message length (SMML) estimators in regular parametric models with countable data spaces. We show that, asymptotically, the optimal SMML partition is induced by a weighted Fisher--Voronoi tessellation in parameter space, pulled back through the maximum likelihood estimator. We further show that each SMML codepoint is asymptotically a weighted average of the maximum likelihood estimates associated with observations in its cell. These results imply that the SMML estimator is consistent and converges at the usual parametric $n{-1/2}$ rate under standard regularity conditions. We also give a Kullback--Leibler projection interpretation of SMML codepoints and a decomposition of the expected SMML codelength into an assertion entropy and an expected conditional cross-entropy. In exponential families, the theory simplifies further: SMML codepoints satisfy a moment-matching condition, and optimal SMML cells are induced by a polyhedral partition of the sufficient-statistic space.

Summary

  • The paper introduces an information‐geometric framework linking SMML estimation with quantization and rate–distortion theory.
  • It demonstrates that SMML estimators converge to maximum likelihood estimators at a parametric rate under Fisher–Rao geometry.
  • Results for exponential family models reveal a polyhedral, moment-matching structure that clarifies both practical and theoretical implications.

Information Geometry and Asymptotic Theory for SMML Estimators

Introduction and Scope

The paper "Information Geometry and Asymptotic Theory for SMML Estimators" (2604.05241) presents a rigorous development of the asymptotic properties of strict minimum message length (SMML) estimators in regular parametric models with countable data spaces. The work offers a comprehensive information-geometric framework for SMML, revealing deep connections to quantization and local model structure, especially in relation to the Fisher–Rao geometry and rate-distortion theory. The authors formalize the asymptotic structure, provide convergence results, and analyze the distinctive geometry in exponential family settings.

SMML: Interpretation and Criteria

The SMML principle produces two-part codes minimizing expected codelength by partitioning the data space and assigning parameter codepoints to each partition cell. The objective codelength for a partition P\mathcal{P} is given by: I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*) where qjq_j are cell assertion probabilities and r(x)r(\mathbf{x}) is the marginal data distribution. Each codepoint θj\bm{\theta}_j^* is selected as the Kullback–Leibler projection of the normalized distribution over PjP_j onto the model family, minimizing the within-cell expected log-loss. Hence, SMML codepoints admit an interpretation as information projections. Moreover, the SMML codelength naturally decomposes into an assertion entropy and an expected conditional cross-entropy, directly connecting to rate–distortion theory.

Asymptotic Quantization and Regularity

The asymptotic regime of the SMML estimator is considered as sample size nn \to \infty. The partition of parameter space induced by SMML forms a quantization whose local mesh size, in Fisher–Rao metric, scales as O(n1/2)O(n^{-1/2}), yielding a codepoint lattice of effective size kn=O(np/2)k_n = O(n^{p/2}) for pp-dimensional models. Crucially, this granularity guarantees that as I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*)0 increases, SMML codepoints become tightly concentrated around the maximum likelihood estimators (MLEs) corresponding to each data partition cell.

Fisher–Rao Geometry and SMML Partitions

A central result is the asymptotic characterization of optimal SMML partitions in terms of information geometry. The optimal data-space partition is the pullback, via the MLE map, of a weighted Fisher–Voronoi tessellation in the parameter space. Each observation is assigned to the codepoint minimizing a sum of squared Fisher–Rao distances and a term depending on the codepoint assertion probability. If assertion probabilities are asymptotically uniform, SMML partitions align with classical unweighted Fisher–Voronoi tessellations.

Each codepoint in a SMML partition is shown to be an asymptotic weighted average of the MLEs corresponding to data in its partition cell, up to I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*)1 corrections. This provides a strong geometric link between the local structure of the codebook and the global partitioning imposed by SMML.

Consistency and Convergence Rate

Under standard regularity conditions, the SMML estimator is proven to be consistent: it converges in probability to the true generating parameter I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*)2 at the parametric rate I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*)3. This convergence result relies on the vanishing local partition diameter imposed by the Fisher–Rao geometry, establishing that the coding discretization does not distort the asymptotic behavior relative to MLEs.

SMML in Exponential Families

For models in the exponential family, the SMML theory simplifies and specializes:

  • The codepoint for each partition cell is characterized through a moment matching condition—its model expectation of sufficient statistics matches the (cellwise) I(P)=jqjlogqjjxPjr(x)logpn(xθj)\mathcal{I}(\mathcal{P}) = -\sum_j q_j \log q_j - \sum_j \sum_{\mathbf{x}\in P_j} r(\mathbf{x}) \log p_n(\mathbf{x}|\bm{\theta}_j^*)4-weighted average of sufficient statistics.
  • SMML partition cells correspond to convex polyhedra in the space of sufficient statistics, with partition boundaries given by affine inequalities in natural parameters.
  • For multinomial and binomial models, these cells map naturally to intervals or convex polytopes in count space, providing explicit descriptions that can be leveraged in combinatorial and computational investigations.

These connections embed SMML in the dually flat geometry of exponential families, where the Fisher information provides a Riemannian metric, natural and mean value parameters form dual affine coordinate systems, and the KL divergence is a Bregman divergence with affine Voronoi structure.

Practical and Theoretical Implications

Practically, these results show that SMML estimators yield statistically efficient inferences with transparent geometric interpretation. The rate–distortion perspective elucidates SMML's role as a coding-theoretic regularizer: it enforces a discrete quantization of parameter space, marrying model-based data compression with parameter estimation. Because the SMML partitions and their geometric structure are determined by the local information content of the statistical model, this approach adapts naturally to model features and data complexities.

Theoretically, the asymptotic alignment with Fisher–Rao Voronoi geometry suggests further analysis relating to optimal quantization, parametric complexity, and information geometry. The KL projection interpretation links SMML directly to variational principles in information theory.

Moreover, the generality of the arguments—requiring only regularity, local quadratic log-likelihood structure, and suitable quantization—suggests extensions to more complex models, including those with singularities, latent structure, or overparameterization.

Future Directions

Notable open questions include:

  • Relaxing the quantization assumptions by deriving the geometric structure of SMML partitions from global optimality conditions, without postulating regular local mesh size
  • Extending the framework to singular or infinite-dimensional models, where Fisher–Rao geometry may degenerate or require nonparametric approaches
  • Studying computational methods for approximating optimal SMML codebooks in high-dimensional or combinatorial settings, especially for large-scale model selection

An intrinsic information-geometric derivation of SMML's properties would further strengthen its conceptual and technical foundations, potentially linking it to developments in statistical learning theory and non-asymptotic inference.

Conclusion

This paper systematically establishes that, in regular parametric models, the SMML estimator asymptotically possesses a rich information-geometric structure governed by local Fisher–Rao metrics and KL projections. The estimator is both statistically consistent and interpretable within rate–distortion and quantization theory. In exponential family models, the partition and codepoint structure become explicitly polyhedral and moment-matching, reflecting the dually flat geometry of these families. The framework opens new perspectives on the interplay between information theory, statistical estimation, and geometric structure in inference.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.