A Minimum Description Length Approach to Regularization in Neural Networks (2505.13398v1)

Published 19 May 2025 in cs.LG and cs.CL

Abstract: State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.

Summary

The paper introduces a Minimum Description Length (MDL) regularization approach that balances data fit with model complexity to steer models toward perfect rule-based solutions.
Experiments using free-form RNNs and genetic algorithms on formal language tasks show MDL’s effectiveness in reducing cross-entropy loss deviations compared to L1 and L2 penalties.
The study highlights practical benefits and challenges of non-differentiable optimization, suggesting MDL could enhance systematic generalization in complex neural architectures.

This paper investigates why state-of-the-art neural networks, despite their capacity to represent perfect solutions for many problems (especially those involving formal rules), often converge to mere approximations. The authors argue that standard regularization techniques like $L_1$ and $L_2$ norm penalties, or even the absence of regularization, can actively push models away from ideal solutions. They propose Minimum Description Length (MDL) regularization as a more principled approach that successfully guides models towards these perfect solutions.

The core idea of MDL is to balance the fit to the data with the complexity of the model. The MDL objective is to minimize $|H| + |D:H|$ , where:

$|H|$ is the encoding length (complexity) of the model (hypothesis).
$|D:H|$ is the description length of the data given the model, which for neural networks is equivalent to the cross-entropy (CE) loss: $- \sum_{i=1}^{n} \sum_{t=1}^{m} \log \left( q(c_{it}) \right)$ .

Unlike $L_1/L_2$ regularization, which only penalizes weight magnitudes, MDL considers the full information content of the network, including its architecture and the precision of its parameters. For instance, weights and biases are encoded as signed fractions. A simple fraction like $1/10$ (representing $0.1$) has a shorter description length (and is thus preferred by MDL) than a more complex fraction like $1117/50000$ (representing $0.02234$), even if the latter has a smaller magnitude. This penalizes "information smuggling" through high-precision weights.

Methodology

The researchers conducted experiments using "free-form RNNs," which are general directed graphs of units with biases and weighted connections, capable of representing various tasks. They focused on next-token prediction tasks based on formal languages:

$a^n b^n$ : Requires counting.
$a^n b^n c^n$ : Requires two counters.
Dyck-1: Well-matched parentheses (counting).
Dyck-2: Well-matched parentheses and brackets (stack).
Arithmetic Syntax: Nested addition formulas (stack).
Toy-English: A minimal English fragment with relative clauses.

For each task, a "golden network"—an RNN manually constructed or previously discovered to perfectly solve the task by matching the true data-generating grammar—was used. These golden networks served as initializations or benchmarks.

Three experimental setups were used (illustrated in Figure 1):

Experiment 1: Genetic Architecture Search: A genetic algorithm (GA) using an Island Model evolved both the architecture and parameters of RNNs. Networks were initialized with the golden network.
Experiment 2: GA in Weight-Training Setting: The GA only mutated weights and biases, keeping the golden network's architecture fixed.
Experiment 3: Gradient Descent (GD): Golden networks (using only differentiable activations) were trained using backpropagation with $L_1$ , $L_2$ , or no regularization. MDL was excluded here due to its non-differentiability.

Regularization Methods Compared:

MDL: $|H|$ calculated by encoding network structure (unit count, types, activations, biases) and parameters (weights/biases as fractions using prefix-free coding).
$L_1$ : $\lambda \sum |w_i|$
$L_2$ : $\lambda \sum w_i^2$
None (with $|H|$ limit): CE loss minimization with a cap on model complexity (3x golden network's $|H|$ ) to prevent uncontrolled growth in GA experiments.

Evaluation:

Performance was measured by the CE loss ( $|D:H|$ ) on an exhaustive test set (all strings up to a length threshold, weighted by true probabilities). Results were reported as the relative gap to the analytically computed optimal score: $\Delta (\%) = \frac{|D:H| - \text{Optimal}}{\text{Optimal}} \times 100$ . Smoothing ( $10^{-10}$ ) was added to zero probabilities to avoid infinite scores.

Experiments and Results

Experiment 1: Genetic Architecture Search (Table \ref{tab:ga-search-results}, Figure \ref{fig:exp-1-target-plot})

MDL: Consistently yielded the smallest test set $|D:H|$ deviations from the optimal score across all tasks. MDL either preserved the golden network or, for manually constructed (potentially oversized) golden nets, compressed them significantly while maintaining near-optimal performance.
$L_1, L_2$ , None: These methods resulted in networks with higher test set $|D:H|$ scores, often significantly deviating from the optimum. They tended to increase network complexity ( $|H|$ ) and frequently produced "infinite" scores (before smoothing) due to assigning zero probability to correct next symbols.

Experiment 2: GA in Weight-Training Setting (Table \ref{tab:ga-weight-results})

Even with a fixed architecture, MDL regularization outperformed $L_1$ and $L_2$ . It maintained or simplified the golden network and achieved test scores closest to the optimum.
$L_1, L_2$ , and No Regularization often led to degraded performance on the test set compared to the initial golden network.

Experiment 3: Gradient Descent (Table \ref{tab:backprop_results})

For tasks with differentiable golden networks ( $a^nb^n, a^nb^nc^n$ , Dyck-1), training with GD using $L_1, L_2$ , or no regularization consistently caused the models to drift away from the perfect golden solution, increasing the test set $|D:H|$ .
This suggests the problem is not just the optimization algorithm but the objective function itself. Standard regularizers do not make perfect solutions local minima.

Discussion and Practical Implications

The key takeaway is that only MDL-based regularization consistently favored perfect or near-perfect solutions. Standard $L_1/L_2$ regularization, by focusing solely on weight magnitudes, fails to prevent overfitting or guide search towards truly generalizable solutions for these rule-based tasks. They ignore other complexity aspects like network structure and parameter precision.

Practical Implications for Implementation:

Choosing Regularization: For tasks requiring precise, rule-based generalization, MDL offers a more robust alternative to $L_1/L_2$ . If a system is failing to learn underlying rules despite architectural capacity, the regularization method might be a key factor.

MDL Implementation:

Model Complexity ( $|H|$ ): Requires defining an encoding scheme for the network. This involves:

Encoding the number of units.
For each unit: its type, activation function, and bias (as a fraction).
Encoding connections (source, target, recurrent/forward).

Encoding weights as signed fractions (sign bit, numerator, denominator using prefix-free codes like Elias gamma coding). The paper uses the scheme from Li & Vitanyi (2008) for encoding integers in fractions.

// Pseudocode for fractional weight encoding length
function get_fraction_encoding_length(numerator, denominator):
  // Using a prefix-free code for integers (e.g., Elias gamma)
  length_num = prefix_free_encode_length(abs(numerator))
  length_den = prefix_free_encode_length(denominator)
  length_sign = 1 // for the sign bit
  return length_sign + length_num + length_den

Data Fit ( $|D:H|$ ): This is the standard cross-entropy loss.
Objective Function: $L_{MDL} = \lambda_{H} \cdot |H| + CE\_Loss$ . The paper implicitly uses $\lambda_{H}=1$ as per the formal definition $|H| + |D:H|$ .

Optimization:
- The non-differentiable nature of the $|H|$ term (due to discrete architectural choices and the specific fractional encoding) makes MDL unsuitable for direct optimization with gradient descent.
- Genetic algorithms or other evolutionary strategies are more appropriate, as demonstrated. The paper uses an Island Model GA.
- If using GA:
  - Define mutation operations: adding/removing units/connections, modifying weights/biases (e.g., by small increments or re-sampling), changing activation functions.
  - The fitness function is the MDL objective.
Evaluation: Focus on generalization to unseen data that tests the learned rules, not just accuracy on i.i.d. test samples. Exhaustive test sets for formal languages are ideal. Comparing against a theoretical optimum, if derivable, provides a strong benchmark.
"Free-form RNNs": This architectural flexibility allows the GA to explore diverse solutions. When implementing, this means not being restricted to standard RNN/LSTM/GRU cells if the search space allows.
Cognitive Plausibility: MDL aligns with human learning principles (simplicity preference), suggesting it might help build models that generalize more like humans, especially from limited data or for tasks requiring "System 2" (controlled, accurate) processing.

The authors suggest that the failures of current LLMs on tasks requiring systematic generalization might also stem from inadequate regularization. While not yet tested empirically on Transformers, MDL could offer benefits.

Limitations and Future Work

Non-differentiability: The primary limitation is the difficulty of optimizing MDL with gradient-based methods, which are highly optimized. Future work could explore:
- Hardware/software for efficient non-differentiable optimization.
- Differentiable approximations or surrogate losses for MDL.
Scale: Experiments were on relatively small-scale formal language tasks. Applying and evaluating MDL on larger, more complex tasks and architectures (like Transformers) is a key next step.

All code and experimental data are available at: https://github.com/taucompling/mdl-reg-approach (2505.13398).

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/matanabudy/status/1925607106318676451

https://twitter.com/matanabudy/status/1925607092905189619