Synthetic Pre-training on SCMs
- Synthetic pre-training on SCMs is a framework that generates diverse tabular tasks using randomly constructed structural causal models.
- It employs token-efficient serialization and teacher-guided distillation from tree-based models to enhance large language models' in-context learning.
- Empirical scaling laws indicate that increasing the number of in-context examples steadily improves accuracy, rivaling classical machine learning models.
Synthetic pre-training on Structural Causal Models (SCMs) refers to a framework for continued pretraining of LLMs using a diversified, vast corpus of tabular prediction tasks synthesized directly from randomly generated SCMs. This approach, as operationalized in MachineLearningLM, aims to enhance the in-context learning (ICL) capability of LLMs—particularly their ability to learn from many-shot tabular classification demonstrations—while explicitly preserving the underlying model's general knowledge and reasoning faculties. The central methodology serializes millions of artificial tabular classification problems into token-efficient prompts, leverages decision-level distillation from tree-based models such as random forests (RFs), and utilizes standard next-token language modeling objectives in pretraining. Empirically, this yields strong, monotonic scaling of accuracy with the number of in-context examples, outperforming existing LLM baselines and matching the performance of robust classical ML models across several domains.
1. Structural Causal Model Formalism and Task Generation
In synthetic pre-training, each data-generating process for a tabular task is formalized as a structural causal model: where denotes exogenous noise variables, endogenous variables (features), $F=\{f_v : \Pa(v) \to x_v\}_{v \in V}$ a collection of node-specific structural functions, and the product distribution over , which is typically .
SCMs are instantiated via:
- Sampling a directed acyclic graph (DAG) with a layer structure akin to a fully-connected multilayer perceptron (depth , width ).
- Imposing edges between all nodes in adjacent layers, with a random topological ordering.
- Defining as a randomly chosen nonlinear activation (e.g., , ReLU, SiLU, , ) with 70% probability, or a gradient-boosted tree regressor (GBR) fit to Gaussian noise (30% probability).
- Exogenous noise terms drawn i.i.d. from (alternatives such as uniform or Student’s t supported but unimportant empirically).
Sampling from such an SCM involves a topological pass through the DAG: $x_v = f_v(x_{\Pa(v)}) + \varepsilon_v, \qquad \varepsilon_v \sim \mathcal{N}(0,1)$ This framework yields richly structured and statistically diverse synthetic tabular datasets.
2. Construction of Synthetic Tabular Prediction Tasks
Once an SCM is sampled, tabular classification tasks are constructed as follows:
- Generate a large set of i.i.d. samples by ancestral sampling on the SCM.
- Designate one coordinate as a latent regression "score."
- Select a classification arity with quantile boundaries defined on ’s empirical distribution.
- Discretize into labels by assigning class indices according to quantile bins, randomly shuffling class IDs.
- Subsample "demonstration" pairs (), query inputs, and set dimensionality . This yields millions of tasks with varied type, dimensionality, class distribution, and underlying generative dependencies.
3. Token-Efficient Serialization and Prompt Construction
To maximize context utilization, MachineLearningLM employs a serialization strategy based on comma-delimited tabular rows without natural language filler. Each synthetic prompt consists of:
- An instruction header .
- Demonstration set containing rows: each is followed by feature integers and a label.
- Query block with rows: each is followed by feature integers. All features are z-normalized and quantized into :
The output target is a single JSON array, with predicted labels for all queries:
This design delivers a 3x–6x increase in task-per-context window efficiency and enables up to 50x throughput improvement via batch inference.
4. Teacher-Guided Distillation and Pretraining Objective
To stabilize early-stage learning on pure synthetic distributions, the framework employs a random-forest "warm-start" distillation process:
- Each synthetic classification task is first evaluated by a small random forest (RF) trained on the demonstration set.
- Tasks where the RF fails to exceed a conservative label-distribution baseline () are discarded, verified with a binomial test (), Cohen’s , and balanced-accuracy checks.
- During the warm-up (1 million tasks), only those query instances where are retained. The LLM is trained to reproduce RF labels using the standard left-to-right next-token negative log-likelihood: After the warm-up, distillation is discontinued, and pretraining continues directly on the synthetic ground-truth labels.
5. Empirical Scaling Laws and Generalization
MachineLearningLM reveals a pronounced, monotonic increase in ICL accuracy as the number of demonstrations grows from $8$ to $1024$, distinguishing itself from vanilla LLM baselines that plateau or degrade at high . On the 32-dataset TALENT subset, average test accuracy is as follows:
| # shots | 8 | 16 | 32 | 64 | 128 | 512 |
|---|---|---|---|---|---|---|
| MachineLearningLM | 58.4% | 63.1% | 66.7% | 70.0% | 72.0% | 75.3% |
A simple log-linear law describes this trend: , with gain per doubling of .
Across 200 real tabular tasks from finance, biology, physics, and healthcare, MachineLearningLM outperforms other LLM-based systems (e.g., GPT-5-mini, o3-mini, Qwen-Instruct) by approximately 15 percentage points in 512-shot ICL, and matches Random Forests within 2 points at high . MMLU benchmark results confirm preservation of general chat and reasoning ability (0-shot 73.2%, 50-shot 75.4%).
6. Architectural and Practical Considerations
MachineLearningLM leverages a Qwen-2.5-7B-Instruct backbone with LoRA rank 8 adapters. The integer-based, tabular prompt format succinctly fits more data in each context window, facilitating high-throughput batch prediction and broader coverage of many-shot scenarios. No architectural modifications or auxiliary output heads are required beyond standard next-token autoregressive training. No task-specific fine-tuning is performed; all general-purpose knowledge and reasoning skills are preserved.
The two-stage training—teacher-guided, then ground-truth-driven—ensures both robustness in numerical modeling and sample efficiency. Token-efficient serialization supports scale-out to millions of synthetic tasks per pretraining epoch, greatly increasing the diversity and coverage of functional relationships encountered by the LLM. A plausible implication is that similar frameworks could be extended to other domains where high-quality synthetic simulators exist.
7. Significance and Outlook
Synthetic pre-training on SCMs enables general-purpose LLMs to reach or exceed specialist ML models' performance on tabular prediction under in-context learning, with scaling laws indicating continued gains up to and beyond 1,024-shot settings. This approach provides a fully self-supervised recipe for instilling amortized learning-to-learn behaviors, leveraging highly structured, richly varied synthetic problems generated en masse. Given the demonstrated retention of core LLM reasoning and knowledge abilities, further exploration of SCM-based synthetic pretraining may catalyze advances in other domains where ICL and sample-efficiency are critical, and represents a principled avenue for bridging classical ML decision strategies with deep foundation models.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free