The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning (2304.05366v3)

Published 11 Apr 2023 in cs.LG and stat.ML

Abstract: No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains. Our experiments show that pre-trained and even randomly initialized LLMs prefer to generate low-complexity sequences. Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, we explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.

Citations (31)

View on Semantic Scholar

Summary

The paper shows that real-world data’s low Kolmogorov complexity refutes NFL assumptions by emphasizing data compressibility in practical settings.
It demonstrates that neural networks compress labeling functions effectively, linking minimized cross-entropy to a strong simplicity bias in model behavior.
It introduces a Kolmogorov-style NFL theorem that reframes learning challenges and advocates for flexible models with soft inductive biases.

This paper, "The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning" (2304.05366), argues that while No Free Lunch (NFL) theorems suggest no single learner can solve all problems well, their underlying assumptions (like uniformly random data distributions) do not reflect real-world scenarios. Instead, the authors posit that real-world datasets possess low Kolmogorov complexity (i.e., they are highly compressible or describable by short programs), and modern machine learning models, particularly neural networks, share this inherent preference for simplicity. This alignment, they argue, explains the trend towards unifying diverse problems with a shrinking set of powerful models like transformers.

The core contributions and practical insights can be summarized as follows:

Real-World Data is Simple (Low Kolmogorov Complexity):

The paper demonstrates that typical machine learning datasets are far from random and are highly compressible. For instance, text datasets like Amazon Review Full and audio datasets like LibriSpeech can be compressed significantly using standard tools like bzip2. This compression provides an upper bound on their Kolmogorov complexity. The probability of observing such low complexities if the data were uniformly random is astronomically small, refuting the NFL's premise of high-complexity data.
- Practical Implication: This suggests that algorithms biased towards simpler explanations are well-suited for real-world tasks because the data itself is structured and simple.
Neural Networks as Compressors of Labeling Functions: The authors show that neural networks can act as effective compressors for dataset labels $Y$ $Y$ given inputs $X$ $X$ . The negative log-likelihood minimized during training is directly related to the code length required to describe the labels using the model. Equation \ref{e.cond-kolmogorov-by-ce} formalizes this:

$\tfrac{1}{n}K(Y|X) \le \text{CE} \cdot \ln 2 + n^{-1}(K(p) + 2\log_2 K(p) +c)$

where $K(Y|X)$ is the Kolmogorov complexity of labels given inputs, CE is the cross-entropy, $n$ is dataset size, and $K(p)$ is the complexity of the model $p$ .
- Implementation: This concept is demonstrated by compressing labels of tabular datasets using MLPs and image datasets (CIFAR-10, CIFAR-100) using CNNs. The compressed sizes are significantly smaller than naive encodings, indicating the models capture underlying structure.
- Figure \ref{fig:pac-bounds} (left & middle): Shows compressed label sizes for tabular and image data, highlighting that models effectively compress labels, far exceeding simple encoding schemes.
A Kolmogorov-Style No Free Lunch Theorem:

A new NFL theorem is derived (Theorem 1) which directly links the impossibility of learning on uniformly random data to its incompressibility (high Kolmogorov complexity). It states that with high probability, for a uniformly sampled dataset, any classifier $p$ will have a cross-entropy CE $(p)$ close to random guessing if the dataset is large and the model complexity $K(p)$ is bounded.
- Practical Implication: This reframes NFL: learning is hard on random, incompressible data, but real data isn't like that. Learning is possible on compressible, low-complexity data.

Neural Networks Exhibit a Low-Complexity Bias:

The paper presents several experiments to show that neural networks inherently prefer simpler solutions.

Cross-Domain Generalization via Simplicity: CNNs, designed for vision, were trained on tabular datasets (features reshaped into images). They generalized well, and PAC-Bayes generalization bounds based on model compressibility could explain this performance (Figure \ref{fig:pac-bounds}, right). This suggests CNNs have a generic simplicity bias applicable even to data without spatial structure.
- Practical Implication: The inductive biases of models like CNNs might be more general than their original design domain suggests, driven by a fundamental preference for low complexity.

GPT-3 Prefers Simpler Sequences: GPT-3 models were evaluated on their ability to assign probabilities to integer sequences generated by expression trees of varying complexity (number of operators). Results (Figure \ref{fig:gpt3}) show that GPT-3 variants assign exponentially higher probabilities to sequences generated by simpler trees. Larger, more powerful GPT-3 models (e.g., Davinci) exhibit this bias more strongly.

// Pseudocode for generating sequences and testing GPT-3
function generate_sequences(max_complexity):
  sequences_by_complexity = map()
  for complexity from 0 to max_complexity:
    trees = enumerate_expression_trees(complexity) // e.g., using i, 2, +, *, //
    for tree in trees:
      sequence = evaluate_tree(tree)
      add sequence to sequences_by_complexity[complexity]
  return sequences_by_complexity

function test_gpt3_preference(sequences_by_complexity):
  for complexity in sequences_by_complexity:
    for sequence in sequences_by_complexity[complexity]:
      tokenized_sequence = tokenize(sequence) // e.g., "1, 2, 3" -> [" 1", ",", " 2", ",", " 3"]
      log_prob = gpt3_model.get_log_probability(tokenized_sequence)
      record (complexity, log_prob)
  plot average log_prob vs. complexity

Randomly Initialized LLMs Also Prefer Low Complexity: Even randomly initialized GPT-2 models, before any training, show a preference for generating low-complexity binary sequences (defined by the length of the shortest repeating bitstring). Pre-trained models show an even stronger bias. This suggests the simplicity bias can be partly architectural.

Automating Model Selection with a Simplicity Bias:

The paper argues against the common notion (often attributed to NFL) that practitioners must manually select specialized learners for each task.

Generalization of Meta-Learners: Selecting among a large pool of models (even millions) using cross-validation can still generalize well. Finite hypothesis bounds show the generalization gap depends on the logarithm of the number of hypotheses, making selection feasible with moderately sized validation sets.

One Model for Varied Data Sizes: A single, flexible model incorporating a "simplicity bias" can perform well across different dataset sizes.

Example 1 (Polynomial Regression): A high-degree polynomial with Tikhonov regularization (penalizing higher-degree coefficients more) performs comparably to low-degree polynomials on small datasets and high-degree polynomials on large datasets.

Example 2 (Neural Networks): A combined model using logits from a small network (GoogLeNet) and a large network (ViT-B/16 or Swin-B), with an

\ell_2

penalty encouraging the use of the smaller network, achieves strong performance on both small (CIFAR) and large (ImageNet) datasets (Figure \ref{fig:bigsmall}).

# Pseudocode for combined model
# logits_small_model = googlenet(input)
# logits_large_model = vit(input)
# c = learnable_parameter_initialized_near_zero() #  favors 1-c close to 1
# combined_logits = c * logits_large_model + (1-c) * logits_small_model
# loss = cross_entropy_loss(combined_logits, labels) + lambda * c**2

Practical Implication: Instead of meticulously choosing model sizes, practitioners could use larger, flexible models with built-in mechanisms (like regularization) that favor simpler solutions when data is scarce, but allow complexity when data is abundant.

Key Conceptual Insights and Takeaways:

NFL Theorems are Misapplied: The NFL theorems' assumptions about data distributions (e.g., uniform over all possibilities) are unrealistic. Real-world problems are highly structured (low Kolmogorov complexity). Therefore, NFL should not be used to argue against general-purpose learners or the need for highly specialized inductive biases for every problem.
Embrace Flexibility with Soft Biases: Good general-purpose models should have a flexible hypothesis space combined with "soft" inductive biases (encouraging, not restricting) towards common structures (like symmetries or simplicity). This is preferable to hard architectural constraints.
Neural Networks' Inherent Simplicity Bias: NNs, including LLMs, naturally prefer low-complexity solutions. This bias, combined with their flexibility, contributes to their success as general-purpose problem solvers.
Single Model Principle: In principle, the same flexible model with a simplicity bias can be effective for both small and large datasets, challenging the conventional wisdom of needing different model capacities for different data regimes.
Architecture Over Optimizer Implicit Bias: The paper suggests that architectural design, which makes generalizable solutions more accessible in the loss landscape, is more critical for generalization than the implicit biases of stochastic optimizers.

In essence, the paper provides a compelling argument, supported by theoretical derivations and empirical evidence, that the success of modern ML, especially the rise of general-purpose models like transformers, is deeply connected to the low-complexity nature of real-world data and the inherent simplicity bias of these models. This perspective shifts the focus from the limitations suggested by NFL theorems to the opportunities presented by the shared structure between problems and models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/andrewgwils/status/1760395130433847699

https://twitter.com/ntenenz/status/1851070414778220824

https://twitter.com/akjagadish/status/1818727844173987881

https://twitter.com/hs0ci3ty/status/1799771163390104022

https://twitter.com/adrianicosma/status/1799738282265510151

https://twitter.com/andrewgwils/status/1940328438516289709

YouTube

Show All Videos

HackerNews

Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning (11 points, 1 comment)