Information Complexity in Concept Learning

Updated 4 March 2026

Information Complexity in Concept Learning is defined via measures like mutual information and Shannon entropy to quantify the minimal data information a learning algorithm must encode.
It bridges classical sample complexity with information leakage, establishing theoretical bounds and practical implications across symbolic, statistical, and deep learning models.
The framework unifies cognitive and statistical perspectives, predicting human learning difficulty and guiding efficient algorithm design in both conventional and neural settings.

Information complexity in concept learning quantifies, in rigorous information-theoretic terms, the minimal amount of information about the training data that a learning algorithm must encode in its output hypothesis in order to correctly learn a target concept. This notion extends classical complexity measures by capturing not just sample complexity but the informational burden of learning, with substantial implications for generalization, sample efficiency, and the behavior of learning algorithms—both symbolic and deep—across domains. Information complexity has also emerged as a predictive metric for both machine and human concept learning difficulty, drawing formal connections between statistical learning theory, information theory, cognitive science, and the empirical performance of modern learning systems.

1. Formal Definitions and Metrics

Information complexity is primarily operationalized via mutual information between the sample $S$ and the hypothesis $A(S)$ output by a learner $A$ :

$I(S;A(S)) = H(A(S)) - H(A(S) | S)$

where $S$ is a realizable training set and $H(\cdot)$ denotes Shannon entropy. This measures the number of bits "leaked" about $S$ in the learner's output, and thus lower-bounds any lossy compression or code-length scheme for the hypothesis (Nachum et al., 2018, Bassily et al., 2017). Alternate definitions, such as conditional mutual information (CMI) and Kolmogorov-style task complexity, appear in specialized contexts to accommodate distributional or task-specific structure (Haghifam et al., 2020, Achille et al., 2019).

For a concept class $H$ , the information complexity is

$IC(H) = \sup_{m} \inf_{A \text{ proper, consistent}} \sup_{D} I(S;A(S))$

where infimum is over all learning algorithms $A$ using $m$ examples, and the supremum is over all realizable distributions $D$ (Nachum et al., 2018).

Information complexity is also instantiated as "Boolean complexity" in human and in-context machine concept learning, counting the minimal number of logical operators required in a symbolic rule consistent with the data (Wang et al., 2024):

$\mathrm{Complexity}(C)=\min_{\phi:\llbracket \phi\rrbracket=C}\bigl|\{\mathrm{op}\in\phi\}\bigr|$

where $\phi$ ranges over well-formed formulas in a fixed grammar.

"Information complexity" in cognitive science is further defined via conditional entropy reductions when subsets of stimulus dimensions are specified (Pape et al., 2014), leading to two canonical aggregations: minimal and average uncertainty.

2. Information Complexity in Symbolic and Statistical Concept Learning

In supervised learning theory, information complexity provides a sharp characterization of the "compression" required by any proper, consistent learner for a hypothesis class of given combinatorial dimension. Fundamental results show that for classes of VC dimension $d$ over large domains, the information complexity can scale as $\Omega(d\log\log(|\mathcal{X}|/d))$ bits (Nachum et al., 2018). Practical implications include:

No learning algorithm, no matter how sample-efficient, can output hypotheses with less than this mutual information with the data.
Empirical risk minimization (ERM) and compression schemes are tightly linked: a compression size- $k$ scheme yields an information complexity upper bound $k\log m$ (for $m$ examples), but there exist classes exhibiting a strict separation between compression length and information complexity (Bassily et al., 2017).
For VC classes with infinite "hollow-star number," the information leakage (as measured by CMI) of any proper learner must be at least $\Omega(d\log n)$ bits for $n$ examples; thus, it is impossible to uniformly bound information leakage by $O(d)$ across all proper learners (Haghifam et al., 2020).

Distribution-dependent settings admit sharper bounds: for fixed marginals, one can often construct deterministic, proper and consistent learners revealing $O(d\log m)$ or even $O(d)$ bits for most concepts (in an average-case sense) (Bassily et al., 2017, Nachum et al., 2018).

3. Information Complexity in Cognitive and Human Learning

The notion of information complexity as a predictor of human concept learning difficulty has a longstanding tradition in cognitive science. In classic rule-based frameworks, Boolean complexity—the count of minimal logical operators needed to specify a concept—predicts adult human learning curves for separable (factorial) category types (Wang et al., 2024, Pape et al., 2014). However, this approach fails for general learning contexts.

"Information complexity," as formalized via Shannon entropy and conditional entropy reductions, generalizes predictive power across paradigms. Specifically, by considering the lowering of output uncertainty when subsets of relevant features are fixed, two aggregation metrics emerge:

Minimal uncertainty ( $\tilde{I}_{min}$ ): Aggregates the minimal conditional entropies, matching the paradigm-specific learning difficulty ordering for adults and symbolic learners.
Average uncertainty ( $\tilde{I}_{avg}$ ): Aggregates average conditional entropies, capturing the general order for children, monkeys, and integral stimuli.

This framework accurately predicts concept acquisition difficulty in classical SHJ tasks and extends to higher-dimensional logical categories (Pape et al., 2014). Boolean complexity and other logical metrics (e.g., GIST) are generally predictive only for the former (paradigm-specific) orderings.

4. Information Complexity in Deep Learning, In-Context Learning, and Task Transfer

Recent work applies information complexity measurements to both deep learning and in-context learning in LLMs:

LLMs tested with synthetic concept learning tasks demonstrate a steep, statistically significant decline in in-context learning accuracy as Boolean complexity (minimal operator count) increases. The observed Pearson correlations are strongly negative (up to $-0.96$ , $p<0.01$ ) across multiple LLM architectures and scales (Wang et al., 2024).
LLMs thus reliably "prefer" simpler concepts, evidencing a bias toward low-complexity formulas paralleling the "simplicity bias" in humans and trained neural networks (Wang et al., 2024).
This suggests that the ease of in-context concept learning in such models is modulated by formally quantifiable rule complexity, and that formal minimal-description-length metrics developed for cognition provide robust, cross-modal predictors of machine learning performance.

In the context of deep neural networks and transfer learning, Achille and Soatto (Achille et al., 2019) develop a Kolmogorov-style information complexity framework:

$C(D) = \min_{p(y|x)} [L_D(p) + K(p)]$

where $L_D(p)$ is empirical cross-entropy and $K(p)$ is Kolmogorov complexity of the predictive model. This approach formalizes the learning–memorization trade-off and enables a measure of task distances via asymmetric Kolmogorov complexity, shedding light on the cost of task transfer and the structure of task similarity space.

Classical and Bayesian forms of information complexity in deep learning relate to compression (minimum description length), mutual information between data and weights (Shannon), and the trace of the Fisher information matrix (Achille et al., 2019).

5. Combinatorial Parameters and Foundations

Combinatorial dimensions such as VC dimension and Littlestone dimension play a central role in determining the information complexity of learning:

For proper, consistent PAC learners for classes of VC dimension $d$ , the information complexity can necessitate leakage of $\Omega(d\log\log|\mathcal{X}|/d)$ bits; no proper learner can escape this lower bound for worst-case VC classes (Nachum et al., 2018, Haghifam et al., 2020).
Average-case information complexity, however, is $O(d)$ bits for most concepts, contrasting the worst-case scenario and supporting the learning $\implies$ compression intuition for randomly chosen targets (Nachum et al., 2018).
Littlestone dimension $d$ is both necessary and sufficient for a class to have finite information complexity in online learning and realizable PAC settings. There exist explicit globally stable algorithms certified to "forget" all but $2^{O(d)}$ bits about the input data, matching the optimal online mistake bound (Pradeep et al., 2022).
For canonical cases (e.g., indicators of affine subspaces), the information complexity can be upper-bounded logarithmically in $d$ (Pradeep et al., 2022).

6. Practical Implications and Theoretical Consequences

Information complexity governs fundamental sample efficiency and generalization bounds. Specific results include:

Generalization error for a $d$ -bit information learner is $O(\sqrt{d/m})$ , where $m$ is the sample size (Bassily et al., 2017). To guarantee that the empirical error approximates the population error with high probability, $m = O((d+\ln(1/\delta))/\varepsilon^2)$ suffices for error $\varepsilon$ , confidence $1-\delta$ .
There are lower bounds (e.g., thresholds on large domains) showing that empirical risk minimizers must reveal $\Omega(\log\log N)$ bits (with $N$ the domain size), regardless of compression size (Bassily et al., 2017).
For nontrivial concept classes, average-case information leakage can be tightly controlled by Haussler nets and minimax theorems, although the worst-case remains large (Nachum et al., 2018).
The Kolmogorov-style structure function and Lagrangian frameworks distinguish genuine learning from memorization, with structure-function curves empirically correlating to task difficulty and generalization in real and randomized datasets (Achille et al., 2019).

In the context of in-context learning for LLMs, information complexity directly quantifies the "probabilistic grammars" underlying few-shot learning, suggesting both interpretability and route to bias control via grammar/primitives selection (Wang et al., 2024).

7. Open Problems and Future Directions

Several open questions persist:

For the general hypothesis class, can the exponential dependence of information complexity on Littlestone dimension be improved to polynomial or linear in $d$ ? Most known upper bounds are tight only in extremal or specifically constructed cases (Pradeep et al., 2022).
Is it possible to close the gap between average-case and worst-case information complexity in proper learning of VC classes? In practice, average-case compression is observed, but worst-case hardness persists (Nachum et al., 2018, Haghifam et al., 2020).
Can information complexity bounds for improper learners be universally reduced to $O(d)$ , or do class pathologies prevent unconditional improvements?
How do specific choices of logical or algebraic primitives in rule-based settings influence the realized information complexity bias in neural systems and LLMs (Wang et al., 2024)?
For deep learning, can continuous relaxations or differentiable approximations of information complexity be made computationally tractable for regularization or interpretability?
How do optimization procedures (e.g., SGD, learning rate schedules) alter dynamic vs static information complexity, and are there algorithmic designs that provably minimize information leak while preserving generalization (Achille et al., 2019)?

The study of information complexity across symbolic, statistical, cognitive, and neural learning settings continues to refine the understanding of the intrinsic and practical limits of generalization, learning, and concept acquisition.