Efficient optimization of asymptotically optimal codes for Transformers

Develop efficient optimization methods for asymptotically optimal description length objectives for Transformer encoders, including the asymptotically optimal two-part codes and the adaptive variational codes parameterized by Gaussian mixture priors, so that these objectives can be minimized effectively in practice under finite computational resources.

Background

The paper proves the existence of asymptotically optimal families of description length codes for Transformer encoders, both in two-part and variational forms, and further constructs a tractable, differentiable adaptive variational objective using Gaussian mixture priors. These codes have strong asymptotic compression guarantees grounded in Kolmogorov complexity.

Empirically, the authors find that standard gradient-based optimization fails to discover low-complexity solutions from random initialization on algorithmic tasks, indicating a gap between the theoretical guarantees and practical optimization. This motivates the unresolved question of whether these asymptotically optimal objectives can be optimized efficiently in practice.

References

A family of codes that is asymptotically optimal represents a theoretical ideal, and while we have shown that practical instances of such codes exist, we have yet to show that they can be efficiently optimized.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers (2509.22445 - Shaw et al., 26 Sep 2025) in Appendix, Section "Asymptotically Quasi-Optimal Families of Codes" (first paragraph)