Generator Module for Exact Synthetic Data Sampling

Updated 12 November 2025

Generator Module (Gen) is a core component in TreeGen that builds a probability tree to model the exact joint distribution of categorical data.
It implements a Monte Carlo tree traversal algorithm to generate synthetic records that preserve observed conditional and marginal frequencies.
The module offers fast, parallelizable generation with full interpretability and exact frequency matching, while scalability may be challenged in high-dimensional settings.

The Generator Module (“Gen”) is the central component of TreeGen, a Python package designed for exact, interpretable Monte Carlo sampling over the empirical joint distribution of a categorical data frame. Gen operates by traversing a compact probability-tree structure derived directly from the observed frequencies of unique rows in the data set, facilitating precise generation of synthetic records, data augmentation, compression, and feature extraction that faithfully respect all empirical conditional and marginal distributions.

1. Probability Tree Data Structure

Let a data frame $D$ have $m$ ordered columns $C_1, \dots, C_m$ and $N$ rows. The probability tree $T$ is a rooted tree of depth $m$ , with each root-to-leaf path $(v_1, \dots, v_m)$ uniquely corresponding to a possible record $(C_1=v_1, \dots, C_m=v_m)$ with frequency exactly matching its empirical count in $D$ .

Each node at level $k$ encodes:

columnName = C_{k+1}

A list of dataNode objects, each with:

`value = v

(a unique value of

C_{k+1}

under the prefix

[v_1,\dots, v_k]

)</li> <li>`probability = p(v \mid \text{path}_k)

, where

\text{path}_k = (v_1, ..., v_k)

- `nextNode =

child Node for level

k+1

(built from rows of

satisfying

C_1=v_1, ..., C_{k+1}=v

)</li> </ul></li> </ul> <p>Branch probabilities are computed as</p> <p>

p\bigl(C_{k+1}=v \mid C_{1}=v_{1},\dots,C_{k}=v_{k}\bigr) = \frac{n(\text{path}_k,v)}{n(\text{path}_k)}

</p> <p>where

n(\text{path}_k,v)

is the number of rows in

matching the prefix and

C_{k+1}=v

.</p> <p>Because the tree stores conditional probabilities at each node, the full joint for a path

(v_1,\dots,v_m)

is</p> <p>

P(v_1,\dots,v_m) = \prod_{k=0}^{m-1} p(C_{k+1}=v_{k+1}\mid C_{1}=v_1, \dots, C_k=v_k)

</p> <p>which guarantees that all empirical joint, marginal, and conditional frequencies are exactly recoverable by the tree’s structure.</p> <h2 class='paper-heading' id='monte-carlo-tree-traversal-algorithm'>2. Monte Carlo Tree Traversal Algorithm</h2> <p>The Gen module implements sampling by random walks through the probability tree:</p> <ul> <li>Begin at the root node.</li> <li>At each level

k$, randomly select a child according to the conditional probabilities stored in the current node.</li> <li>Continue recursively until a leaf is reached, thus generating a complete synthetic record.</li> </ul> <p>Pseudocode for generating a single record:

function getRecord(node, k=0):
    if k == m:
        return []
    u ← rand.uniform(0,1)
    cum ← 0
    for each (v_i, p_i, nextNode_i) in node.data:
        cum ← cum + p_i
        if u ≤ cum:
            suffix ← getRecord(nextNode_i, k+1)
            return [v_i] + suffix
    return [v_last] + getRecord(nextNode_last, k+1)

Each call selects one value per column, and because choices are sequentially made based on empirical empirical conditionals, generated samples preserve the joint-dependence structure in expectation.</p> <p>Gen manages a private pseudo-random number generator (<code>random.Random</code> or NumPy RNG), enabling bitwise-reproducible runs via explicit seed control.</p> <p>Generated statistics (marginals, conditionals) converge to those of the original data via the law of large numbers.</p> <h2 class='paper-heading' id='implementation-and-complexity'>3. Implementation and Complexity</h2> <p>The TreeGen implementation utilizes the following class architecture:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Class</th> <th>Attributes/Methods</th> <th>Functionality</th> </tr> </thead><tbody><tr> <td>Node</td> <td>columnName (str), data (list of dataNode)</td> <td>Tree node at each column</td> </tr> <tr> <td>dataNode</td> <td>value (category), probability (float), nextNode (Node or None)</td> <td>Branch from a node</td> </tr> <tr> <td>ProbabilityTree</td> <td><code>__init__</code>, <code>_build_subtree</code>, <code>getTree</code>, <code>getColumns</code></td> <td>Tree construction from a DataFrame</td> </tr> <tr> <td>Generator</td> <td><code>__init__</code>, <code>setSeed</code>, <code>getRecord</code>, <code>getRecords</code></td> <td>Monte Carlo generation of synthetic records</td> </tr> </tbody></table></div> <p>Complexity:</p> <ul> <li>Tree construction: $O(m\cdot N \log N)

(per-level pass with sorting/hashing)</li> <li>Space:

O(\min(N, mV^m))

(

= maximum cardinality per column), often far less in practice (sparse tables)</li> <li>Generation per record:

O(m\cdot V)

; typically

O(m)

if

is small</li> <li>Output: Generating

samples costs

O(NmV)

time,

O(Nm)

output space</li> </ul> <h2 class='paper-heading' id='usage-scenarios-and-workflows'>4. Usage Scenarios and Workflows</h2> <p>TreeGen supports a concise data-augmentation and <a href="https://www.emergentmind.com/topics/synthetic-data-generation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">synthetic data generation</a> pipeline:</p> <p>

and per-column cardinalities increase; memory and construction costs may be prohibitive for very high-dimensional or high-cardinality data.</li> <li>Memory usage: Each unique observed prefix (partial record) yields a node; worst-case

O(V^m)

<a href="https://www.emergentmind.com/topics/neural-ordinary-differential-equations-nodes" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">nodes</a>.</li> <li>Construction overhead: For very large datasets (

N \gg 10^6

) or many columns (

m \gg 100$), initial tree-building can become a bottleneck.

No generalization beyond support: Unseen value combinations in training have zero probability in generated data. Smoothing, hybrid modeling, or back-off strategies are required if extrapolation beyond observed support is desired.

6. Significance and Context

The Gen module in TreeGen addresses the problem of preserving empirical multivariate categorical dependencies in synthetic data generation, without the opacity or ad-hoc constraints of typical parametric or tree-split approaches. Its strict nonparametricity and full transparency make it particularly suitable for statistical data compression, exact simulation studies, and tasks requiring preservation of conditional frequency structure. Unlike decision trees, which generally enforce binary splits and do not encode empirical frequencies, Gen’s probability tree achieves maximum fidelity at the cost of possible combinatorial expansion in state-space. This positions TreeGen’s Gen as a robust tool for small to moderate-dimensional categorical datasets where interpretability and exact reproduction are essential (Niemczynowicz et al., 2020).

PDF Markdown Chat (Pro)

References (1)

TreeGen -- a Monte Carlo generator for data frames (2020)

Follow Topic

Get notified by email when new papers are published related to Generator Module (Gen).

Generator Module for Exact Synthetic Data Sampling

1. Probability Tree Data Structure

6. Significance and Context

Follow Topic

Continue Learning

Related Topics