Generator Module for Exact Synthetic Data Sampling
Updated 12 November 2025
Generator Module (Gen) is a core component in TreeGen that builds a probability tree to model the exact joint distribution of categorical data.
It implements a Monte Carlo tree traversal algorithm to generate synthetic records that preserve observed conditional and marginal frequencies.
The module offers fast, parallelizable generation with full interpretability and exact frequency matching, while scalability may be challenged in high-dimensional settings.
The Generator Module (“Gen”) is the central component of TreeGen, a Python package designed for exact, interpretable Monte Carlo sampling over the empirical joint distribution of a categorical data frame. Gen operates by traversing a compact probability-tree structure derived directly from the observed frequencies of unique rows in the data set, facilitating precise generation of synthetic records, data augmentation, compression, and feature extraction that faithfully respect all empirical conditional and marginal distributions.
1. Probability Tree Data Structure
Let a data frame D have m ordered columns C1,…,Cm and N rows. The probability tree T is a rooted tree of depth m, with each root-to-leaf path (v1,…,vm) uniquely corresponding to a possible record (C1=v1,…,Cm=vm) with frequency exactly matching its empirical count in D.
Each node at level k encodes:
columnName = C_{k+1}
A list of dataNode objects, each with:
`value = v(auniquevalueofC_{k+1}undertheprefix[v_1,\dots, v_k])</li><li>‘probability=p(v∣pathk), wherepathk=(v1,...,vk)- `nextNode =childNodeforlevelk+1(builtfromrowsofDsatisfyingC_1=v_1, ..., C_{k+1}=v)</li></ul></li></ul><p>Branchprobabilitiesarecomputedas</p><p>p\bigl(C_{k+1}=v \mid C_{1}=v_{1},\dots,C_{k}=v_{k}\bigr) = \frac{n(\text{path}_k,v)}{n(\text{path}_k)}</p><p>wheren(\text{path}_k,v)isthenumberofrowsinDmatchingtheprefixandC_{k+1}=v.</p><p>Becausethetreestoresconditionalprobabilitiesateachnode,thefulljointforapath(v_1,\dots,v_m)is</p><p>P(v_1,\dots,v_m) = \prod_{k=0}^{m-1} p(C_{k+1}=v_{k+1}\mid C_{1}=v_1, \dots, C_k=v_k)</p><p>whichguaranteesthatallempiricaljoint,marginal,andconditionalfrequenciesareexactlyrecoverablebythetree’sstructure.</p><h2class=′paper−heading′id=′monte−carlo−tree−traversal−algorithm′>2.MonteCarloTreeTraversalAlgorithm</h2><p>TheGenmoduleimplementssamplingbyrandomwalksthroughtheprobabilitytree:</p><ul><li>Beginattherootnode.</li><li>Ateachlevelk$, randomly select a child according to the conditional probabilities stored in the current node.</li>
<li>Continue recursively until a leaf is reached, thus generating a complete synthetic record.</li>
</ul>
<p>Pseudocode for generating a single record:
function getRecord(node, k=0):
if k == m:
return []
u ← rand.uniform(0,1)
cum ← 0
for each (v_i, p_i, nextNode_i) in node.data:
cum ← cum + p_i
if u ≤ cum:
suffix ← getRecord(nextNode_i, k+1)
return [v_i] + suffix
return [v_last] + getRecord(nextNode_last, k+1)
Each call selects one value per column, and because choices are sequentially made based on empirical empirical conditionals, generated samples preserve the joint-dependence structure in expectation.</p>
<p>Gen manages a private pseudo-random number generator (<code>random.Random</code> or NumPy RNG), enabling bitwise-reproducible runs via explicit seed control.</p>
<p>Generated statistics (marginals, conditionals) converge to those of the original data via the law of large numbers.</p>
<h2 class='paper-heading' id='implementation-and-complexity'>3. Implementation and Complexity</h2>
<p>The TreeGen implementation utilizes the following class architecture:</p>
<div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr>
<th>Class</th>
<th>Attributes/Methods</th>
<th>Functionality</th>
</tr>
</thead><tbody><tr>
<td>Node</td>
<td>columnName (str), data (list of dataNode)</td>
<td>Tree node at each column</td>
</tr>
<tr>
<td>dataNode</td>
<td>value (category), probability (float), nextNode (Node or None)</td>
<td>Branch from a node</td>
</tr>
<tr>
<td>ProbabilityTree</td>
<td><code>__init__</code>, <code>_build_subtree</code>, <code>getTree</code>, <code>getColumns</code></td>
<td>Tree construction from a DataFrame</td>
</tr>
<tr>
<td>Generator</td>
<td><code>__init__</code>, <code>setSeed</code>, <code>getRecord</code>, <code>getRecords</code></td>
<td>Monte Carlo generation of synthetic records</td>
</tr>
</tbody></table></div>
<p>Complexity:</p>
<ul>
<li>Tree construction: $O(m\cdot N \log N)(per−levelpasswithsorting/hashing)</li><li>Space:O(\min(N, mV^m))(V=maximumcardinalitypercolumn),oftenfarlessinpractice(sparsetables)</li><li>Generationperrecord:O(m\cdot V);typicallyO(m)ifVissmall</li><li>Output:GeneratingNsamplescostsO(NmV)time,O(Nm)outputspace</li></ul><h2class=′paper−heading′id=′usage−scenarios−and−workflows′>4.UsageScenariosandWorkflows</h2><p>TreeGensupportsaconcisedata−augmentationand<ahref="https://www.emergentmind.com/topics/synthetic−data−generation"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">syntheticdatageneration</a>pipeline:</p><p>!!!!1!!!!</p><p>Principalapplicationsinclude:</p><ul><li>DatamultiplicityincreasefordownstreamMLorstatisticalmodeling</li><li>Bayesiancompression:efficientdatastorageviatree−basedrepresentation</li><li>Visualizationofcategoricalinteractionstructure</li><li>Hierarchical,nonparametricmodelingofrelationshipsamongcolumns</li><li>Featureextraction(e.g.,mostprobablerecord,via<code>tree.getMaxRecord()</code>)</li></ul><h2class=′paper−heading′id=′advantages−trade−offs−and−limitations′>5.Advantages,Trade−offs,andLimitations</h2><p>Comparativeadvantages:</p><ul><li>Multi−waysplits:Unlikebinarydecisiontrees,everynodesupportsarbitrarydegree(numberofchildren),facilitatingfaithfulmodelingofcategoricalvariableswithhigharity.</li><li>Exactempiricalmatching:Nosmoothing,binning,orparametricassumptions;allconditionalprobabilitiesreflectobservedfrequencies.</li><li>Fast,triviallyparallelizablegeneration:Eachsyntheticrecordisindependent(givenPRNGseed).</li><li>Fullinterpretability:Allprobabilitiesexposedintreestructure,enablingdirectinspectionofdependencies.</li></ul><p>Knownlimitations:</p><ul><li>Scalability:Combinatorialblowupasbothmandper−columncardinalitiesincrease;memoryandconstructioncostsmaybeprohibitiveforveryhigh−dimensionalorhigh−cardinalitydata.</li><li>Memoryusage:Eachuniqueobservedprefix(partialrecord)yieldsanode;worst−caseO(V^m)<ahref="https://www.emergentmind.com/topics/neural−ordinary−differential−equations−nodes"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">nodes</a>.</li><li>Constructionoverhead:Forverylargedatasets(N \gg 10^6)ormanycolumns(m \gg 100$), initial tree-building can become a bottleneck.
No generalization beyond support: Unseen value combinations in training have zero probability in generated data. Smoothing, hybrid modeling, or back-off strategies are required if extrapolation beyond observed support is desired.
6. Significance and Context
The Gen module in TreeGen addresses the problem of preserving empirical multivariate categorical dependencies in synthetic data generation, without the opacity or ad-hoc constraints of typical parametric or tree-split approaches. Its strict nonparametricity and full transparency make it particularly suitable for statistical data compression, exact simulation studies, and tasks requiring preservation of conditional frequency structure. Unlike decision trees, which generally enforce binary splits and do not encode empirical frequencies, Gen’s probability tree achieves maximum fidelity at the cost of possible combinatorial expansion in state-space. This positions TreeGen’s Gen as a robust tool for small to moderate-dimensional categorical datasets where interpretability and exact reproduction are essential (Niemczynowicz et al., 2020).