Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Synthetic Pre-training on SCMs

Updated 17 November 2025
  • Synthetic pre-training on SCMs is a framework that generates diverse tabular tasks using randomly constructed structural causal models.
  • It employs token-efficient serialization and teacher-guided distillation from tree-based models to enhance large language models' in-context learning.
  • Empirical scaling laws indicate that increasing the number of in-context examples steadily improves accuracy, rivaling classical machine learning models.

Synthetic pre-training on Structural Causal Models (SCMs) refers to a framework for continued pretraining of LLMs using a diversified, vast corpus of tabular prediction tasks synthesized directly from randomly generated SCMs. This approach, as operationalized in MachineLearningLM, aims to enhance the in-context learning (ICL) capability of LLMs—particularly their ability to learn from many-shot tabular classification demonstrations—while explicitly preserving the underlying model's general knowledge and reasoning faculties. The central methodology serializes millions of artificial tabular classification problems into token-efficient prompts, leverages decision-level distillation from tree-based models such as random forests (RFs), and utilizes standard next-token language modeling objectives in pretraining. Empirically, this yields strong, monotonic scaling of accuracy with the number of in-context examples, outperforming existing LLM baselines and matching the performance of robust classical ML models across several domains.

1. Structural Causal Model Formalism and Task Generation

In synthetic pre-training, each data-generating process for a tabular task is formalized as a structural causal model: S=(U,  V,  F,  PU)\mathcal{S} = (U,\;V,\;F,\;P_U) where U={εv:vV}U = \{\varepsilon_v: v \in V\} denotes exogenous noise variables, V={xv:v=1,,V}V = \{x_v: v = 1, \dots, |V|\} endogenous variables (features), $F=\{f_v : \Pa(v) \to x_v\}_{v \in V}$ a collection of node-specific structural functions, and PUP_U the product distribution over UU, which is typically vN(0,1)\prod_v \mathcal{N}(0,1).

SCMs are instantiated via:

  • Sampling a directed acyclic graph (DAG) with a layer structure akin to a fully-connected multilayer perceptron (depth LL, width ww).
  • Imposing edges between all nodes in adjacent layers, with a random topological ordering.
  • Defining fvf_v as a randomly chosen nonlinear activation (e.g., tanh\tanh, ReLU, SiLU, sin\sin, exp\exp) with 70% probability, or a gradient-boosted tree regressor (GBR) fit to Gaussian noise (30% probability).
  • Exogenous noise terms εv\varepsilon_v drawn i.i.d. from N(0,1)\mathcal{N}(0,1) (alternatives such as uniform or Student’s t supported but unimportant empirically).

Sampling from such an SCM involves a topological pass through the DAG: $x_v = f_v(x_{\Pa(v)}) + \varepsilon_v, \qquad \varepsilon_v \sim \mathcal{N}(0,1)$ This framework yields richly structured and statistically diverse synthetic tabular datasets.

2. Construction of Synthetic Tabular Prediction Tasks

Once an SCM is sampled, tabular classification tasks are constructed as follows:

  1. Generate a large set of i.i.d. samples {x(i)}i=1n\{x^{(i)}\}_{i=1}^n by ancestral sampling on the SCM.
  2. Designate one coordinate y=xvy^\star = x_{v^\star} as a latent regression "score."
  3. Select a classification arity K{2,,10}K \in \{2, \dots, 10\} with quantile boundaries τ1<<τK1\tau_1 < \dots < \tau_{K-1} defined on yy^\star’s empirical distribution.
  4. Discretize yy^\star into labels yy by assigning class indices according to quantile bins, randomly shuffling class IDs.

y=arg maxk{1,,K}[1{τk1y<τk}]y = \argmax_{k \in \{1, \dots, K\}} \left[ \mathbb{1}\{\tau_{k-1} \leq y^\star < \tau_k\} \right]

  1. Subsample MM "demonstration" (x,y)(x, y) pairs (M1024M \leq 1024), N=50N=50 query (x)(x) inputs, and set dimensionality d[5,50]d \in [5, 50]. This yields millions of tasks with varied type, dimensionality, class distribution, and underlying generative dependencies.

3. Token-Efficient Serialization and Prompt Construction

To maximize context utilization, MachineLearningLM employs a serialization strategy based on comma-delimited tabular rows without natural language filler. Each synthetic prompt consists of:

  • An instruction header HH.
  • Demonstration set SS containing MM rows: each is demoIDj:\texttt{demoID}_j: followed by dd feature integers and a label.
  • Query block QQ with NN rows: each is queryIDj:\texttt{queryID}_j: followed by dd feature integers. All features are z-normalized and quantized into [0,999][0,999]: i=clip(round(120z+500),0,999)i = \mathrm{clip}(\mathrm{round}(120\,z + 500), 0, 999)

The output target is a single JSON array, with predicted labels for all queries: [{"id": , "label": },]\text{[}\{\texttt{"id": …, "label": …}\}, \dots \text{]}

This design delivers a 3x–6x increase in task-per-context window efficiency and enables up to 50x throughput improvement via batch inference.

4. Teacher-Guided Distillation and Pretraining Objective

To stabilize early-stage learning on pure synthetic distributions, the framework employs a random-forest "warm-start" distillation process:

  • Each synthetic classification task is first evaluated by a small random forest (RF) trained on the demonstration set.
  • Tasks where the RF fails to exceed a conservative label-distribution baseline (p0=max(kpk2,maxkpk)p_0 = \max(\sum_k p_k^2, \max_k p_k)) are discarded, verified with a binomial test (α=0.2\alpha=0.2), Cohen’s κ>0\kappa>0, and balanced-accuracy checks.
  • During the warm-up (\sim1 million tasks), only those query instances where y^RF=ytrue\hat{y}_\text{RF} = y_\text{true} are retained. The LLM is trained to reproduce RF labels using the standard left-to-right next-token negative log-likelihood: L(θ)=t=1Tlogpθ(ytx,y<t)\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(y_t \mid \mathbf{x}, y_{<t}) After the warm-up, distillation is discontinued, and pretraining continues directly on the synthetic ground-truth labels.

5. Empirical Scaling Laws and Generalization

MachineLearningLM reveals a pronounced, monotonic increase in ICL accuracy as the number of demonstrations MM grows from $8$ to $1024$, distinguishing itself from vanilla LLM baselines that plateau or degrade at high MM. On the 32-dataset TALENT subset, average test accuracy is as follows:

# shots MM 8 16 32 64 128 512
MachineLearningLM 58.4% 63.1% 66.7% 70.0% 72.0% 75.3%

A simple log-linear law describes this trend: Acc(M)a+blog2M\mathrm{Acc}(M) \approx a + b \log_2 M, with b2.3%b \approx 2.3\% gain per doubling of MM.

Across 200 real tabular tasks from finance, biology, physics, and healthcare, MachineLearningLM outperforms other LLM-based systems (e.g., GPT-5-mini, o3-mini, Qwen-Instruct) by approximately 15 percentage points in 512-shot ICL, and matches Random Forests within 2 points at high MM. MMLU benchmark results confirm preservation of general chat and reasoning ability (0-shot \sim73.2%, 50-shot \sim75.4%).

6. Architectural and Practical Considerations

MachineLearningLM leverages a Qwen-2.5-7B-Instruct backbone with LoRA rank 8 adapters. The integer-based, tabular prompt format succinctly fits more data in each context window, facilitating high-throughput batch prediction and broader coverage of many-shot scenarios. No architectural modifications or auxiliary output heads are required beyond standard next-token autoregressive training. No task-specific fine-tuning is performed; all general-purpose knowledge and reasoning skills are preserved.

The two-stage training—teacher-guided, then ground-truth-driven—ensures both robustness in numerical modeling and sample efficiency. Token-efficient serialization supports scale-out to millions of synthetic tasks per pretraining epoch, greatly increasing the diversity and coverage of functional relationships encountered by the LLM. A plausible implication is that similar frameworks could be extended to other domains where high-quality synthetic simulators exist.

7. Significance and Outlook

Synthetic pre-training on SCMs enables general-purpose LLMs to reach or exceed specialist ML models' performance on tabular prediction under in-context learning, with scaling laws indicating continued gains up to and beyond 1,024-shot settings. This approach provides a fully self-supervised recipe for instilling amortized learning-to-learn behaviors, leveraging highly structured, richly varied synthetic problems generated en masse. Given the demonstrated retention of core LLM reasoning and knowledge abilities, further exploration of SCM-based synthetic pretraining may catalyze advances in other domains where ICL and sample-efficiency are critical, and represents a principled avenue for bridging classical ML decision strategies with deep foundation models.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Synthetic Pre-training on SCMs.