Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 133 tok/s
Gemini 3.0 Pro 55 tok/s Pro
Gemini 2.5 Flash 164 tok/s Pro
Kimi K2 202 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Generator Module for Exact Synthetic Data Sampling

Updated 12 November 2025
  • Generator Module (Gen) is a core component in TreeGen that builds a probability tree to model the exact joint distribution of categorical data.
  • It implements a Monte Carlo tree traversal algorithm to generate synthetic records that preserve observed conditional and marginal frequencies.
  • The module offers fast, parallelizable generation with full interpretability and exact frequency matching, while scalability may be challenged in high-dimensional settings.

The Generator Module (“Gen”) is the central component of TreeGen, a Python package designed for exact, interpretable Monte Carlo sampling over the empirical joint distribution of a categorical data frame. Gen operates by traversing a compact probability-tree structure derived directly from the observed frequencies of unique rows in the data set, facilitating precise generation of synthetic records, data augmentation, compression, and feature extraction that faithfully respect all empirical conditional and marginal distributions.

1. Probability Tree Data Structure

Let a data frame DD have mm ordered columns C1,,CmC_1, \dots, C_m and NN rows. The probability tree TT is a rooted tree of depth mm, with each root-to-leaf path (v1,,vm)(v_1, \dots, v_m) uniquely corresponding to a possible record (C1=v1,,Cm=vm)(C_1=v_1, \dots, C_m=v_m) with frequency exactly matching its empirical count in DD.

Each node at level kk encodes:

  • columnName = C_{k+1}
  • A list of dataNode objects, each with:
    • `value = v(auniquevalueof(a unique value ofC_{k+1}undertheprefixunder the prefix[v_1,\dots, v_k])</li><li>probability=p(vpathk))</li> <li>`probability = p(v \mid \text{path}_k), wherepathk=(v1,...,vk)\text{path}_k = (v_1, ..., v_k)- `nextNode =childNodeforlevel child Node for level k+1(builtfromrowsof (built from rows of Dsatisfying satisfying C_1=v_1, ..., C_{k+1}=v)</li></ul></li></ul><p>Branchprobabilitiesarecomputedas</p><p>)</li> </ul></li> </ul> <p>Branch probabilities are computed as</p> <p>p\bigl(C_{k+1}=v \mid C_{1}=v_{1},\dots,C_{k}=v_{k}\bigr) = \frac{n(\text{path}_k,v)}{n(\text{path}_k)}</p><p>where</p> <p>where n(\text{path}_k,v)isthenumberofrowsin is the number of rows in Dmatchingtheprefixand matching the prefix and C_{k+1}=v.</p><p>Becausethetreestoresconditionalprobabilitiesateachnode,thefulljointforapath.</p> <p>Because the tree stores conditional probabilities at each node, the full joint for a path (v_1,\dots,v_m)is</p><p> is</p> <p>P(v_1,\dots,v_m) = \prod_{k=0}^{m-1} p(C_{k+1}=v_{k+1}\mid C_{1}=v_1, \dots, C_k=v_k)</p><p>whichguaranteesthatallempiricaljoint,marginal,andconditionalfrequenciesareexactlyrecoverablebythetreesstructure.</p><h2class=paperheadingid=montecarlotreetraversalalgorithm>2.MonteCarloTreeTraversalAlgorithm</h2><p>TheGenmoduleimplementssamplingbyrandomwalksthroughtheprobabilitytree:</p><ul><li>Beginattherootnode.</li><li>Ateachlevel</p> <p>which guarantees that all empirical joint, marginal, and conditional frequencies are exactly recoverable by the tree’s structure.</p> <h2 class='paper-heading' id='monte-carlo-tree-traversal-algorithm'>2. Monte Carlo Tree Traversal Algorithm</h2> <p>The Gen module implements sampling by random walks through the probability tree:</p> <ul> <li>Begin at the root node.</li> <li>At each level k$, randomly select a child according to the conditional probabilities stored in the current node.</li> <li>Continue recursively until a leaf is reached, thus generating a complete synthetic record.</li> </ul> <p>Pseudocode for generating a single record:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      
      function getRecord(node, k=0):
          if k == m:
              return []
          u ← rand.uniform(0,1)
          cum ← 0
          for each (v_i, p_i, nextNode_i) in node.data:
              cum ← cum + p_i
              if u ≤ cum:
                  suffix ← getRecord(nextNode_i, k+1)
                  return [v_i] + suffix
          return [v_last] + getRecord(nextNode_last, k+1)
      Each call selects one value per column, and because choices are sequentially made based on empirical empirical conditionals, generated samples preserve the joint-dependence structure in expectation.</p> <p>Gen manages a private pseudo-random number generator (<code>random.Random</code> or NumPy RNG), enabling bitwise-reproducible runs via explicit seed control.</p> <p>Generated statistics (marginals, conditionals) converge to those of the original data via the law of large numbers.</p> <h2 class='paper-heading' id='implementation-and-complexity'>3. Implementation and Complexity</h2> <p>The TreeGen implementation utilizes the following class architecture:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Class</th> <th>Attributes/Methods</th> <th>Functionality</th> </tr> </thead><tbody><tr> <td>Node</td> <td>columnName (str), data (list of dataNode)</td> <td>Tree node at each column</td> </tr> <tr> <td>dataNode</td> <td>value (category), probability (float), nextNode (Node or None)</td> <td>Branch from a node</td> </tr> <tr> <td>ProbabilityTree</td> <td><code>__init__</code>, <code>_build_subtree</code>, <code>getTree</code>, <code>getColumns</code></td> <td>Tree construction from a DataFrame</td> </tr> <tr> <td>Generator</td> <td><code>__init__</code>, <code>setSeed</code>, <code>getRecord</code>, <code>getRecords</code></td> <td>Monte Carlo generation of synthetic records</td> </tr> </tbody></table></div> <p>Complexity:</p> <ul> <li>Tree construction: $O(m\cdot N \log N)(perlevelpasswithsorting/hashing)</li><li>Space: (per-level pass with sorting/hashing)</li> <li>Space: O(\min(N, mV^m))( (V=maximumcardinalitypercolumn),oftenfarlessinpractice(sparsetables)</li><li>Generationperrecord: = maximum cardinality per column), often far less in practice (sparse tables)</li> <li>Generation per record: O(m\cdot V);typically; typically O(m)if if Vissmall</li><li>Output:Generating is small</li> <li>Output: Generating Nsamplescosts samples costs O(NmV)time, time, O(Nm)outputspace</li></ul><h2class=paperheadingid=usagescenariosandworkflows>4.UsageScenariosandWorkflows</h2><p>TreeGensupportsaconcisedataaugmentationand<ahref="https://www.emergentmind.com/topics/syntheticdatageneration"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">syntheticdatageneration</a>pipeline:</p><p>!!!!1!!!!</p><p>Principalapplicationsinclude:</p><ul><li>DatamultiplicityincreasefordownstreamMLorstatisticalmodeling</li><li>Bayesiancompression:efficientdatastorageviatreebasedrepresentation</li><li>Visualizationofcategoricalinteractionstructure</li><li>Hierarchical,nonparametricmodelingofrelationshipsamongcolumns</li><li>Featureextraction(e.g.,mostprobablerecord,via<code>tree.getMaxRecord()</code>)</li></ul><h2class=paperheadingid=advantagestradeoffsandlimitations>5.Advantages,Tradeoffs,andLimitations</h2><p>Comparativeadvantages:</p><ul><li>Multiwaysplits:Unlikebinarydecisiontrees,everynodesupportsarbitrarydegree(numberofchildren),facilitatingfaithfulmodelingofcategoricalvariableswithhigharity.</li><li>Exactempiricalmatching:Nosmoothing,binning,orparametricassumptions;allconditionalprobabilitiesreflectobservedfrequencies.</li><li>Fast,triviallyparallelizablegeneration:Eachsyntheticrecordisindependent(givenPRNGseed).</li><li>Fullinterpretability:Allprobabilitiesexposedintreestructure,enablingdirectinspectionofdependencies.</li></ul><p>Knownlimitations:</p><ul><li>Scalability:Combinatorialblowupasboth output space</li> </ul> <h2 class='paper-heading' id='usage-scenarios-and-workflows'>4. Usage Scenarios and Workflows</h2> <p>TreeGen supports a concise data-augmentation and <a href="https://www.emergentmind.com/topics/synthetic-data-generation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">synthetic data generation</a> pipeline:</p> <p>
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      
      import pandas as pd
      from TreeGen.Generator import ProbabilityTree, Generator
      
      df = pd.read_excel('survey_responses.xlsx')
      
      tree = ProbabilityTree(df, verbose=True)
      
      tree.drawTree(graph_filename='tree.svg', show=False)
      
      gen = Generator(tree, seed=42)
      
      synthetic = gen.getRecords(1000)
      print(synthetic.head())
      </p> <p>Principal applications include:</p> <ul> <li>Data multiplicity increase for downstream ML or statistical modeling</li> <li>Bayesian compression: efficient data storage via tree-based representation</li> <li>Visualization of categorical interaction structure</li> <li>Hierarchical, nonparametric modeling of relationships among columns</li> <li>Feature extraction (e.g., most probable record, via <code>tree.getMaxRecord()</code>)</li> </ul> <h2 class='paper-heading' id='advantages-trade-offs-and-limitations'>5. Advantages, Trade-offs, and Limitations</h2> <p>Comparative advantages:</p> <ul> <li>Multi-way splits: Unlike binary decision trees, every node supports arbitrary degree (number of children), facilitating faithful modeling of categorical variables with high arity.</li> <li>Exact empirical matching: No smoothing, binning, or parametric assumptions; all conditional probabilities reflect observed frequencies.</li> <li>Fast, trivially parallelizable generation: Each synthetic record is independent (given PRNG seed).</li> <li>Full interpretability: All probabilities exposed in tree structure, enabling direct inspection of dependencies.</li> </ul> <p>Known limitations:</p> <ul> <li>Scalability: Combinatorial blowup as both
      mandpercolumncardinalitiesincrease;memoryandconstructioncostsmaybeprohibitiveforveryhighdimensionalorhighcardinalitydata.</li><li>Memoryusage:Eachuniqueobservedprefix(partialrecord)yieldsanode;worstcase and per-column cardinalities increase; memory and construction costs may be prohibitive for very high-dimensional or high-cardinality data.</li> <li>Memory usage: Each unique observed prefix (partial record) yields a node; worst-case O(V^m)<ahref="https://www.emergentmind.com/topics/neuralordinarydifferentialequationsnodes"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">nodes</a>.</li><li>Constructionoverhead:Forverylargedatasets( <a href="https://www.emergentmind.com/topics/neural-ordinary-differential-equations-nodes" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">nodes</a>.</li> <li>Construction overhead: For very large datasets (N \gg 10^6)ormanycolumns() or many columns (m \gg 100$), initial tree-building can become a bottleneck.
    • No generalization beyond support: Unseen value combinations in training have zero probability in generated data. Smoothing, hybrid modeling, or back-off strategies are required if extrapolation beyond observed support is desired.

    6. Significance and Context

    The Gen module in TreeGen addresses the problem of preserving empirical multivariate categorical dependencies in synthetic data generation, without the opacity or ad-hoc constraints of typical parametric or tree-split approaches. Its strict nonparametricity and full transparency make it particularly suitable for statistical data compression, exact simulation studies, and tasks requiring preservation of conditional frequency structure. Unlike decision trees, which generally enforce binary splits and do not encode empirical frequencies, Gen’s probability tree achieves maximum fidelity at the cost of possible combinatorial expansion in state-space. This positions TreeGen’s Gen as a robust tool for small to moderate-dimensional categorical datasets where interpretability and exact reproduction are essential (Niemczynowicz et al., 2020).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Generator Module (Gen).