Kolmogorov Complexity in Machine Learning

Updated 24 September 2025

Kolmogorov complexity is a measure of the minimal information required to describe data or models, forming a foundation in algorithmic information theory.
Its application in machine learning supports model selection, regularization, and clustering by employing compression-based proxies and structure functions.
Recent advances integrate Kolmogorov complexity with neural network design and reinforcement learning, influencing architectures like Kolmogorov–Arnold Networks.

Kolmogorov complexity is a central concept from algorithmic information theory that quantifies the minimal amount of information required to describe an object, typically measured as the length of the shortest program that outputs the object on a fixed universal machine. In the context of machine learning, Kolmogorov complexity provides a theoretical framework for understanding the information content, compressibility, and structure of data and models. This article reviews foundational definitions, key results, and practical ramifications of Kolmogorov complexity and its extensions as applied to machine learning, including discrete, continuous, and algebraic settings, recent algorithmic developments, and connections to regularization, generalization, and learning efficiency.

1. Foundations: Discrete and Real Kolmogorov Complexity

In the classical discrete setting, given a universal Turing machine $U$ , the (plain) Kolmogorov complexity $K_U(x)$ of a finite binary string $\bar{x}$ is defined as

$K_U(\bar{x}) = \min \{\,\text{length}(\langle M \rangle) : U(\langle M \rangle) = \bar{x}\,\}$

where $\langle M \rangle$ is the encoding of the program. In machine learning, this represents the minimal amount of algorithmic information required to generate a dataset, a label function, or a model (0809.2754).

Kolmogorov complexity over the reals generalizes this notion to the Blum–Shub–Smale (BSS) model of computation, where the data and computation are over $\mathbb{R}$ instead of bits (0802.2027). For a universal BSS machine $U$ , the real Kolmogorov complexity of $x \in \mathbb{R}^*$ is

$_{(U)}(x) := \min \{\,\text{size}(p) : p \in \mathbb{R}^*,\, U(p) = x\,\}$

where “size” is the number of real constants required in a program. A crucial result is that over the reals, the Kolmogorov complexity of a vector is tightly linked to its transcendence degree: $\operatorname{trdeg}_{\mathbb{Q}}(x) \leq {}_{U_0}(x) \leq \operatorname{trdeg}_{\mathbb{Q}}(x) + c$ where $c$ is a constant (often 1). This algebraic characterization enables measurement of the intrinsic complexity of real-valued data (e.g., feature vectors), and distinguishes “incompressible” from “compressible” data based on algebraic dependence.

2. Stable Complexity and the Problem of Programming Language Dependence

Kolmogorov complexity is invariant up to an additive constant for different universal machines, but in practice, the magnitude of these constants can be large, especially for short strings, leading to ambiguities in the relative complexity of objects. To address this, frameworks have been proposed based on “naturalness” of the computational model. By considering output probability distributions over all programs in collections of natural models—such as small Turing machines (TM(2,2)) or elementary cellular automata (CA(1))—one can define a universal, stable complexity ordering (0804.3459). Group-theoretic symmetries (e.g., reversal, complementation) can further aggregate equivalent objects, sharpening the definition.

Stable complexity measures support development of “language-independent” priors and regularizers in machine learning, providing a consistent basis for model comparison and feature selection, regardless of the choice of programming language or representation.

3. Generalizations: Algorithmic Information, Structure Functions, and Algebraic Approaches

Algorithmic information theory distinguishes between “structural” and “random” components of data, formalized via the Kolmogorov structure function: $h_x(\alpha) = \min \{\log |S| : S \ni x,\ K(S) \leq \alpha\}$ with $S$ a model set and $K(S)$ its complexity. This directly underpins the Minimum Description Length (MDL) principle: select the model $S$ and index of $x$ in $S$ such that $K(S) + \log|S|$ is minimized (0809.2754).

Further, a generalization of “length” in complexity definitions allows for the assignment of non-uniform costs to symbols, yielding so-called generalized length functions. This can connect complexity-based regularization with the entropy of non-uniform distributions and support new regularizers in ML, e.g., for probabilistic modeling where the cost of features or symbols may differ (Fraize et al., 2016).

The extension to categorical structures (categories, functors, natural transformations) introduces a programming language (“Sammy”) for counting the minimal operations to build such structures, providing a measure of categorical model complexity (Yanofsky, 2013). This is conceptually aligned with modern ML frameworks that see compositionality and structure as central.

4. Practical Approximations and Applications in Machine Learning

As Kolmogorov complexity is uncomputable due to the halting problem (Vitanyi, 2020), practical machine learning relies on approximations:

Compression-based proxies: Empirical compressors (Lempel-Ziv, BDM, IMP2-based CTM methods) serve as proxies, with compressed lengths or coding-theorem based estimates used to rank data/model complexity (Flood et al., 2020, Leyva-Acosta et al., 30 Jul 2024).
Regularization: Estimated Kolmogorov complexity terms can be included as differentiable regularizers (e.g., for link prediction in graphs), guiding models toward simpler generative structures and potentially boosting generalization via Occam’s razor effects.
Similarity and clustering: The normalized compression distance (NCD), derived from Kolmogorov complexity, is a universal, feature-free measure for data clustering and similarity (Nasution, 2012).
Learning with model selection/MDL: Many PAC-learning and MDL results tie generalization error and sample complexity to the description length (Kolmogorov complexity) of hypotheses (Pinon et al., 2021, Epstein, 2022). Functions with succinct descriptions can be learned from fewer samples; models “closer” to incompressible data require more data to generalize.
Algorithmic bias in learned representations: Recent investigations show that state-of-the-art neural networks—both pre-trained and even randomly initialized—tend to output or favor low Kolmogorov complexity sequences over incompressible ones, reflecting a universal simplicity bias aligned with real-world data structure (Goldblum et al., 2023).

5. Hierarchies, Model Expressiveness, and Complexity Control

The Kolmogorov complexity of model parameters (notably real weights in analog and evolving neural networks) has direct implications for the computational power and expressiveness of learning systems. By stratifying networks based on the Kolmogorov complexity of their weights (or stochastic sources), one obtains infinite hierarchies of function classes between classical complexity classes (e.g., P and P/poly, or BPP and BPP/log*) (Cabessa et al., 2023). The “information content” in real parameters acts as advice, with more complex weights admitting richer computational classes.

In reinforcement learning and planning, incorporating Kolmogorov complexity directly into reward objectives leads to policies that not only maximize reward but also minimize the algorithmic complexity of action sequences (Stefansson et al., 2021). This enables principled simplicity-regularized control strategies.

6. Theoretical Limits and Extensions

Kolmogorov complexity is fundamentally incomputable. Approximations (e.g., via Coding Theorem Method) are meaningful only up to additive constants, which can be non-trivial for short strings. Experimental studies highlight that “local” agreement of complexity orderings (within fixed-length outputs) is not guaranteed across different models, whereas “global” regularities such as simplicity/incompressibility bias persist (Leyva-Acosta et al., 30 Jul 2024, Vitanyi, 2020). For specific classes (e.g., fixed models or bounded resources), polynomial-time computable lists or probabilistic variants (randomized time-bounded Kolmogorov complexity) provide rigorous relaxations and are being actively investigated with applications to average-case complexity, cryptography, and learning (Lu et al., 2022).

7. Connections to Contemporary Architectures: KANs and Scientific ML

Kolmogorov–Arnold Networks (KANs) and variants leverage the Kolmogorov–Arnold representation theorem to express high-dimensional functions with minimal complexity—using sum-parsed univariate transformations and interpretable basis expansions (Toscano et al., 21 Dec 2024, Faroughi et al., 30 Jul 2025). The architectural structure of KANs and KKANs mirrors minimal description arguments, achieving parameter efficiency, interpretable decompositions, and robust learning across scientific ML tasks. Universal approximation properties tie directly to compressibility, with learning dynamics (monitored via information bottleneck and geometric complexity) showing links between signal-to-noise ratio (SNR), model complexity, and generalization.

KANs consistently outperform MLPs in terms of accuracy, convergence, and capturing relevant functional structure, especially for complex, high-frequency, or physically structured data. Challenges remain in scalability, hyperparameter tuning, and full theoretical analysis, but the Kolmogorov framework provides unifying guidance for architecture and regularization design in future scientific ML systems.

Kolmogorov complexity offers a rigorous, unifying framework for quantifying and regularizing information content, guiding model selection, clustering, regularization, and architectural design in machine learning. Its extensions into algebraic, probabilistic, categorical, and continuous domains expand its analytical scope, while practical approximations and recent architecture designs (such as KANs) demonstrate its increasing integration into modern machine learning methodologies.