Synthetic Pre-training on SCMs

Updated 17 November 2025

Synthetic pre-training on SCMs is a framework that generates diverse tabular tasks using randomly constructed structural causal models.
It employs token-efficient serialization and teacher-guided distillation from tree-based models to enhance large language models' in-context learning.
Empirical scaling laws indicate that increasing the number of in-context examples steadily improves accuracy, rivaling classical machine learning models.

Synthetic pre-training on Structural Causal Models (SCMs) refers to a framework for continued pretraining of LLMs using a diversified, vast corpus of tabular prediction tasks synthesized directly from randomly generated SCMs. This approach, as operationalized in MachineLearningLM, aims to enhance the in-context learning (ICL) capability of LLMs—particularly their ability to learn from many-shot tabular classification demonstrations—while explicitly preserving the underlying model's general knowledge and reasoning faculties. The central methodology serializes millions of artificial tabular classification problems into token-efficient prompts, leverages decision-level distillation from tree-based models such as random forests (RFs), and utilizes standard next-token language modeling objectives in pretraining. Empirically, this yields strong, monotonic scaling of accuracy with the number of in-context examples, outperforming existing LLM baselines and matching the performance of robust classical ML models across several domains.

1. Structural Causal Model Formalism and Task Generation

In synthetic pre-training, each data-generating process for a tabular task is formalized as a structural causal model: $\mathcal{S} = (U,\;V,\;F,\;P_U)$ where $U = \{\varepsilon_v: v \in V\}$ denotes exogenous noise variables, $V = \{x_v: v = 1, \dots, |V|\}$ endogenous variables (features), $F=\{f_v : \Pa(v) \to x_v\}_{v \in V}$ a collection of node-specific structural functions, and $P_U$ the product distribution over $U$ , which is typically $\prod_v \mathcal{N}(0,1)$ .

SCMs are instantiated via:

Sampling a directed acyclic graph (DAG) with a layer structure akin to a fully-connected multilayer perceptron (depth $L$ , width $w$ ).
Imposing edges between all nodes in adjacent layers, with a random topological ordering.
Defining $f_v$ as a randomly chosen nonlinear activation (e.g., $\tanh$ , ReLU, SiLU, $\sin$ , $\exp$ ) with 70% probability, or a gradient-boosted tree regressor (GBR) fit to Gaussian noise (30% probability).
Exogenous noise terms $\varepsilon_v$ drawn i.i.d. from $\mathcal{N}(0,1)$ (alternatives such as uniform or Student’s t supported but unimportant empirically).

Sampling from such an SCM involves a topological pass through the DAG: $x_v = f_v(x_{\Pa(v)}) + \varepsilon_v, \qquad \varepsilon_v \sim \mathcal{N}(0,1)$ This framework yields richly structured and statistically diverse synthetic tabular datasets.

2. Construction of Synthetic Tabular Prediction Tasks

Once an SCM is sampled, tabular classification tasks are constructed as follows:

Generate a large set of i.i.d. samples $\{x^{(i)}\}_{i=1}^n$ by ancestral sampling on the SCM.
Designate one coordinate $y^\star = x_{v^\star}$ as a latent regression "score."
Select a classification arity $K \in \{2, \dots, 10\}$ with quantile boundaries $\tau_1 < \dots < \tau_{K-1}$ defined on $y^\star$ ’s empirical distribution.
Discretize $y^\star$ into labels $y$ by assigning class indices according to quantile bins, randomly shuffling class IDs.

$y = \argmax_{k \in \{1, \dots, K\}} \left[ \mathbb{1}\{\tau_{k-1} \leq y^\star < \tau_k\} \right]$

Subsample $M$ "demonstration" $(x, y)$ pairs ( $M \leq 1024$ ), $N=50$ query $(x)$ inputs, and set dimensionality $d \in [5, 50]$ . This yields millions of tasks with varied type, dimensionality, class distribution, and underlying generative dependencies.

3. Token-Efficient Serialization and Prompt Construction

To maximize context utilization, MachineLearningLM employs a serialization strategy based on comma-delimited tabular rows without natural language filler. Each synthetic prompt consists of:

An instruction header $H$ .
Demonstration set $S$ containing $M$ rows: each is $\texttt{demoID}_j:$ followed by $d$ feature integers and a label.
Query block $Q$ with $N$ rows: each is $\texttt{queryID}_j:$ followed by $d$ feature integers. All features are z-normalized and quantized into $[0,999]$ : $i = \mathrm{clip}(\mathrm{round}(120\,z + 500), 0, 999)$

The output target is a single JSON array, with predicted labels for all queries: $\text{[}\{\texttt{"id": …, "label": …}\}, \dots \text{]}$

This design delivers a 3x–6x increase in task-per-context window efficiency and enables up to 50x throughput improvement via batch inference.

4. Teacher-Guided Distillation and Pretraining Objective

To stabilize early-stage learning on pure synthetic distributions, the framework employs a random-forest "warm-start" distillation process:

Each synthetic classification task is first evaluated by a small random forest (RF) trained on the demonstration set.
Tasks where the RF fails to exceed a conservative label-distribution baseline ( $p_0 = \max(\sum_k p_k^2, \max_k p_k)$ ) are discarded, verified with a binomial test ( $\alpha=0.2$ ), Cohen’s $\kappa>0$ , and balanced-accuracy checks.
During the warm-up ( $\sim$ 1 million tasks), only those query instances where $\hat{y}_\text{RF} = y_\text{true}$ are retained. The LLM is trained to reproduce RF labels using the standard left-to-right next-token negative log-likelihood: $\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(y_t \mid \mathbf{x}, y_{<t})$ After the warm-up, distillation is discontinued, and pretraining continues directly on the synthetic ground-truth labels.

5. Empirical Scaling Laws and Generalization

MachineLearningLM reveals a pronounced, monotonic increase in ICL accuracy as the number of demonstrations $M$ grows from $8$ to $1024$, distinguishing itself from vanilla LLM baselines that plateau or degrade at high $M$ . On the 32-dataset TALENT subset, average test accuracy is as follows:

# shots $M$	8	16	32	64	128	512
MachineLearningLM	58.4%	63.1%	66.7%	70.0%	72.0%	75.3%

A simple log-linear law describes this trend: $\mathrm{Acc}(M) \approx a + b \log_2 M$ , with $b \approx 2.3\%$ gain per doubling of $M$ .

Across 200 real tabular tasks from finance, biology, physics, and healthcare, MachineLearningLM outperforms other LLM-based systems (e.g., GPT-5-mini, o3-mini, Qwen-Instruct) by approximately 15 percentage points in 512-shot ICL, and matches Random Forests within 2 points at high $M$ . MMLU benchmark results confirm preservation of general chat and reasoning ability (0-shot $\sim$ 73.2%, 50-shot $\sim$ 75.4%).

6. Architectural and Practical Considerations

MachineLearningLM leverages a Qwen-2.5-7B-Instruct backbone with LoRA rank 8 adapters. The integer-based, tabular prompt format succinctly fits more data in each context window, facilitating high-throughput batch prediction and broader coverage of many-shot scenarios. No architectural modifications or auxiliary output heads are required beyond standard next-token autoregressive training. No task-specific fine-tuning is performed; all general-purpose knowledge and reasoning skills are preserved.

The two-stage training—teacher-guided, then ground-truth-driven—ensures both robustness in numerical modeling and sample efficiency. Token-efficient serialization supports scale-out to millions of synthetic tasks per pretraining epoch, greatly increasing the diversity and coverage of functional relationships encountered by the LLM. A plausible implication is that similar frameworks could be extended to other domains where high-quality synthetic simulators exist.

7. Significance and Outlook

Synthetic pre-training on SCMs enables general-purpose LLMs to reach or exceed specialist ML models' performance on tabular prediction under in-context learning, with scaling laws indicating continued gains up to and beyond 1,024-shot settings. This approach provides a fully self-supervised recipe for instilling amortized learning-to-learn behaviors, leveraging highly structured, richly varied synthetic problems generated en masse. Given the demonstrated retention of core LLM reasoning and knowledge abilities, further exploration of SCM-based synthetic pretraining may catalyze advances in other domains where ICL and sample-efficiency are critical, and represents a principled avenue for bridging classical ML decision strategies with deep foundation models.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Synthetic Pre-training on SCMs.