Principled sampling of synthetic generators to match target domains

Develop principled methods to guide sampling from neural cellular automata–based synthetic data generators so that the resulting structures match the computational and statistical characteristics of specified downstream domains, rather than relying solely on coarse complexity measures such as gzip compressibility and alphabet size.

Background

The paper shows that pre-pre-training LLMs on neural cellular automata (NCA) improves downstream language modeling and reasoning, and that transfer effectiveness depends strongly on the synthetic data’s structural complexity. Empirically, the optimal complexity band (measured by gzip compressibility) differs by domain: web text and math favor higher complexity, while code performs best at intermediate complexity.

While gzip compressibility and alphabet size provide useful but coarse controls over NCA complexity, the authors argue that complexity is multifaceted and that identifying the right axes (e.g., NCA network size, grid size, or measures like epiplexity) for each target domain is necessary to reliably match synthetic data structure to downstream tasks. A principled method for guiding synthetic sampling along the right complexity dimensions is therefore needed to enable scalable, domain-targeted synthetic pre-training.

References

This points to a key open problem for future work: developing principled methods to guide synthetic generators to sample structures that match those of target domains.

Training Language Models via Neural Cellular Automata  (2603.10055 - Lee et al., 9 Mar 2026) in Discussion — Limitations and open problems subsection