Universal pre-training by iterated random computation (2506.20057v1)

Published 24 Jun 2025 in cs.LG

Abstract: We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization.

Summary

The paper demonstrates that iterated random computation on synthetic data can pre-train models to perform nearly as well as the best in class on various downstream tasks.
It leverages class-universal distributions by sampling from random LSTMs to generate autoregressively synthetic sequences for training transformers.
Empirical results show significant zero-shot improvements and finetuning speedups, with pre-trained models converging roughly 1M instances faster than randomly initialized baselines.

Universal Pre-Training by Iterated Random Computation: A Technical Overview

This paper presents a rigorous theoretical and empirical investigation into universal pre-training via iterated random computation. The central thesis is that models can be pre-trained on data generated by random computational processes—specifically, by passing random noise through randomly initialized neural networks—yielding representations that generalize across a wide range of downstream tasks. The work is grounded in algorithmic complexity theory, particularly class-bounded prefix-free complexity, and extends prior results on Solomonoff induction and universal sequence modeling.

Theoretical Foundations

The paper formalizes the notion of universal pre-training by leveraging the concept of class-universal distributions. Given a model class $C$ (e.g., Turing machines, LSTMs), the class-universal distribution $m_C$ is constructed by sampling a model $c \in C$ according to a prior $p(c)$ and then sampling data from $c$ using random bits. The key property is that $m_C$ dominates any individual $c \in C$ up to a multiplicative constant, ensuring that, in expectation, a model trained to approximate $m_C$ will perform nearly as well as the best model in $C$ for any data generated by $C$ .

The paper extends this framework by considering the iterative application of random computation: random noise is passed through a randomly sampled model from $C$ , and this process is repeated, forming a hierarchy of increasingly complex distributions. Theoretically, under certain conditions (notably when $C$ is sufficiently expressive, such as the class of LSTMs), this iterative process approaches the universal distribution in the limit, thus justifying the term "universal pre-training."

A significant theoretical contribution is the demonstration that, for practical model classes like LSTMs, the iterative process can approximate the universal distribution to any desired degree, subject to resource constraints. The analysis includes proofs of domination, convergence properties, and the compatibility of the prefix-free framework with sequential prediction tasks.

Practical Implementation

The practical instantiation involves generating synthetic data by sampling random LSTMs, conditioning on random sequences, and sampling outputs autoregressively. This synthetic data is then used to pre-train a standard autoregressive transformer. The implementation details are as follows:

Data Generation: For each training instance, a new LSTM is randomly initialized. A random seed and a conditional sequence are provided, and the LSTM generates a sequence of tokens (from a 256-character vocabulary) via autoregressive sampling.
Buffering Mechanism: To approximate independent sampling from the class-universal distribution, a buffer of sequences is maintained. Each iteration, a subset of the buffer is replaced with new samples from the current LSTM, and training batches are drawn from the buffer.
Model Architecture: The target model is a transformer with width and depth scaled according to established scaling laws. The source LSTM's width is scaled proportionally to the target model.
Training Objective: The transformer is trained with next-token prediction (negative log-likelihood), aligning its output distribution with the synthetic data distribution.

The codebase is available at https://github.com/pbloem/up, facilitating reproducibility and further experimentation.

Empirical Results

The empirical evaluation is comprehensive, spanning both synthetic and real-world datasets:

Zero-Shot Generalization: Models pre-trained on synthetic data exhibit non-trivial zero-shot performance on a variety of downstream tasks, including natural language, code, and structured synthetic data. Notably, performance improves with model scale, and in some cases, the pre-trained model outperforms in-context Markov models.
Finetuning: Pre-trained models, when finetuned on real-world data (e.g., Wikipedia, Linux kernel code), converge faster and generalize better to out-of-domain data compared to models trained from scratch. The pre-training cost can be amortized over multiple downstream tasks.
Ablation Studies: The paper systematically ablates components such as the buffering mechanism, the depth of iterative computation, and the choice of random data generator (LSTM, transformer, automaton, pointwise random). Results indicate that both the structure of the random computation and the iterative enrichment are critical for effective pre-training.

Numerical Results and Claims

Zero-shot performance: The pre-trained models consistently achieve lower bits-per-character than chance (8 bits/char) and, on real-world data, outperform Markov baselines.
Scaling: Performance improves monotonically with model width and depth, suggesting the existence of a scaling law for universal pre-training.
Finetuning speedup: Pre-trained models reach optimal performance on downstream tasks approximately 1M instances faster than randomly initialized baselines of the same size.

Implications and Limitations

Practical Implications

Data-Compute Tradeoff: The results demonstrate that computational resources can substitute for real-world data in pre-training, potentially alleviating data scarcity and privacy concerns.
Universality: While the practical implementation does not achieve true universality (as sampling from the universal distribution is intractable), the approach is broadly applicable across domains and tasks, provided the data-generating process is computational.
Deployment: Pre-trained models can be distributed without concerns about proprietary or sensitive data, enabling open-source and privacy-preserving AI systems.

Theoretical Implications

Algorithmic Complexity: The work bridges algorithmic information theory and practical machine learning, showing that class-universal distributions can be approximated with neural networks.
No-Free-Lunch Theorem: By restricting attention to computational data-generating processes, the approach sidesteps the no-free-lunch theorem, enabling universal patterns to be learned and transferred.

Limitations and Future Directions

Approximate Universality: The practical method samples from a restricted class (finite LSTMs, fixed sequence length), and the probability of sampling highly complex structures decays exponentially with depth.
Domain Generalization: Experiments are limited to token sequences; extension to vision, audio, and multimodal data remains an open challenge.
Resource Efficiency: The computational cost of generating and training on synthetic data is substantial, and the tradeoff may not always be favorable compared to collecting real data.

Speculation on Future Developments

Scaling Laws: Further work may establish precise scaling laws for universal pre-training, analogous to those observed in LLMs.
Unified Pre-Training: The approach could be extended to unified architectures capable of handling multiple modalities, leveraging universal computational structure.
Open-Source Foundation Models: Universal pre-training may enable the creation of large, openly available foundation models that are free from data licensing and privacy constraints.

Conclusion

This paper provides a rigorous theoretical and empirical foundation for universal pre-training via iterated random computation. By grounding the approach in algorithmic complexity and demonstrating practical benefits in zero-shot generalization and finetuning, it opens a promising avenue for data-efficient, broadly applicable pre-training strategies. The work highlights a fundamental data-compute tradeoff and suggests that, with sufficient computational investment, models can acquire generalizable structure from synthetic data alone. Future research will determine the scalability, efficiency, and universality of this paradigm across domains and tasks.