Papers
Topics
Authors
Recent
Search
2000 character limit reached

Guided FP-Growth: Targeted Pattern Mining

Updated 20 May 2026
  • Guided FP-Growth is a targeted mining method that computes exact occurrence counts for specified itemsets by restricting FP-tree traversal to relevant branches.
  • It employs a TIS-tree structure to guide the mining process, thereby minimizing computational overhead and memory usage compared to conventional FP-growth.
  • Empirical evaluations show dramatic improvements, with orders-of-magnitude speedups in rule mining for imbalanced datasets, as demonstrated by the Minority-Report Algorithm.

Guided FP-Growth (GFP-growth) is a method for multitude-targeted mining that computes the exact occurrence counts of a specified, potentially large set of target itemsets within transactional data. By restricting tree traversal to only those branches relevant to user-specified queries, GFP-growth achieves dramatic gains in time and memory efficiency over traditional FP-growth pattern mining. Its theoretical guarantees and practical implementations make it particularly effective for applications such as rule mining in highly imbalanced datasets, as demonstrated in the Minority-Report Algorithm (Shabtay et al., 2018).

1. Formal Definition of the Multitude-Targeted Mining Problem

Let I={a1,a2,,am}I = \{a_1, a_2, \dots, a_m\} denote the universe of mm items, and DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\} a transaction database, where each TiIT_i \subseteq I. An itemset αI\alpha \subseteq I has support σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|, where C(α)C(\alpha) is the number of transactions containing α\alpha. The multitude-targeted mining problem is defined as:

Given:

  • An FP-tree constructed over DBDB, either for all items or items exceeding a minimum support,
  • A collection Q={α(1),,α(k)}\mathcal{Q} = \{\alpha^{(1)}, \dots, \alpha^{(k)}\} of mm0 target itemsets,

Compute mm1 for each mm2, traversing only the parts of the FP-tree that are required for these queries.

This restriction distinguishes GFP-growth from standard frequent pattern mining, which enumerates all frequent patterns, often resulting in superfluous computation.

2. GFP-Growth Algorithm Structure

GFP-growth coordinates the traversal of two trees:

  • The FP-tree (mm3), where the header table mm4 maps items to their tree node linked lists,
  • The Target Item-Set tree (TIS-tree), where nodes correspond to all prefixes of itemsets in mm5.

Each TIS-tree node mm6 encapsulates:

  • mm7: the occurrence count associated with mm8 (initialized to mm9),
  • DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}0: a boolean indicating if DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}1.

Traversal proceeds in support-ascending order, such that the path in the TIS-tree mirrors the bottom-up conditional pattern-based expansion in FP-growth. The algorithm's core is as follows (proxying comments from the original pseudocode):

  • Iterate over children DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}2 of the current TIS-node in support order.
  • If DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}3 is nonempty, accumulate counts over DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}4.
  • For target nodes, record DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}5.
  • If children exist, recursively construct a filtered conditional FP-tree using only items needed in the relevant subtree (FilteredItemSet), then recurse.

This structure ensures that only the minimal necessary subtrees are constructed and counts computed exclusively for the set DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}6.

3. Guided Conditional-Pattern-Base Construction

A central optimization is the guided construction of conditional pattern bases:

  • For a prefix DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}7 at a TIS-node, a compact structure (bitmap or hash set) enumerates the “FilteredItemSet”—items that appear in any descendant target of DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}8.
  • During conditional FP-tree construction, any item DB={T1,T2,,Tn}DB = \{T_1, T_2, \dots, T_n\}9 FilteredItemSet is omitted.

This pruning substantially limits the size of conditional trees, focusing compute effort on only those transactions and items relevant for the downstream targets. This strategy contrasts with conventional FP-growth, which would otherwise include all co-occurring items at each recursion.

4. Correctness Guarantees

The main correctness theorem asserts:

Upon GFP-growth termination, for every target node TiIT_i \subseteq I0 with itemset TiIT_i \subseteq I1, TiIT_i \subseteq I2 in TiIT_i \subseteq I3.

Proof (sketch): By induction, the FP-subtree rooted at any prefix TiIT_i \subseteq I4 faithfully represents transactions containing TiIT_i \subseteq I5. Recursing only when TiIT_i \subseteq I6 is nonempty ensures that TiIT_i \subseteq I7 is counted exactly, and guided conditional FP-trees remain correct due to the induction assumption. If TiIT_i \subseteq I8, no further recursion occurs and TiIT_i \subseteq I9 remains zero.

5. Complexity Analysis

Let:

  • αI\alpha \subseteq I0 (transactions),
  • αI\alpha \subseteq I1 (items),
  • αI\alpha \subseteq I2 (average transaction length),
  • αI\alpha \subseteq I3 (number of targets),
  • αI\alpha \subseteq I4 (average length of target itemsets),
  • αI\alpha \subseteq I5 (number of FP-tree nodes actually visited by GFP-growth).

Time and memory requirements are:

Operation Time Complexity Memory Complexity
FP-tree construction αI\alpha \subseteq I6 αI\alpha \subseteq I7
TIS-tree over αI\alpha \subseteq I8 targets αI\alpha \subseteq I9
GFP-growth recursion σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|0 (pruned) σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|1 (TIS-tree plus bitmaps)

Overall,

  • Time(GFP-growth) = σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|2
  • Time(FP-growth) = σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|3

with σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|4 whenever σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|5 is much less than the total frequent patterns. Memory overhead is limited to the TIS-tree and lightweight filtering structures at each TIS-node.

6. Optimizations and Implementation Considerations

  • Header table lookups are σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|6, avoiding unnecessary subtree evaluations.
  • Conditional tree construction is omitted for TIS-leaf nodes.
  • Each TIS-node maintains its FilteredItemSet, minimizing conditional tree size.
  • Exact counting is performed only for target nodes, reducing unnecessary accumulation for intermediate prefixes.
  • Consistent item ordering (support-descending for FP-tree, support-ascending for TIS-tree) supports memory locality and avoids additional sorting.

These choices collectively enhance practical efficiency and scalability.

7. Application: The Minority-Report Algorithm

The Minority-Report Algorithm leverages GFP-growth for targeted rule mining in class-imbalanced datasets. It finds all association rules of the form σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|7 (minority class), exceeding minimum support σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|8 and confidence σ(α)=C(α)/DB\sigma(\alpha) = C(\alpha) / |DB|9. Steps include:

  1. Compute occurrence counts C(α)C(\alpha)0 for items in the rare class subset C(α)C(\alpha)1; retain those with C(α)C(\alpha)2.
  2. Construct FP-trees FPC(α)C(\alpha)3 (majority) and FPC(α)C(\alpha)4 (minority) over the reduced item universe.
  3. Mine FPC(α)C(\alpha)5 for all C(α)C(\alpha)6 with C(α)C(\alpha)7 above the threshold; insert into TIS-tree.
  4. Use GFP-growth on FPC(α)C(\alpha)8 to assign C(α)C(\alpha)9 counts.
  5. For each α\alpha0, compute α\alpha1 and output rules that meet α\alpha2.

Correctness follows from FP-growth finding all minority-class frequent itemsets and GFP-growth yielding corresponding majority counts. Resulting support and confidence values are exact, and all qualifying rules are output.

8. Empirical Evaluation

Implementation was performed by extending Christian Borgelt's C FP-growth codebase. Experiments covered both synthetic Bernoulli-model and real (UCI Adult/Census Income) datasets, with varying rarity of the minority class (α\alpha3). Table 1 summarizes representative speedup results of GFP-growth vs. FP-growth.

Scenario α\alpha4 Transactions Speedup (FP/GFP)
Simulation (α\alpha5, α\alpha6) 0.01 50,000 20× – 80×
Simulation (α\alpha7, α\alpha8) 0.10 50,000 5× – 20×
Census Income (real data) 0.05 22,500 30× – 50×

As the target class becomes rarer, the gains of GFP-growth increase markedly, confirming that focused, guided mining of pre-specified itemsets achieves orders-of-magnitude performance improvement over exhaustive frequent pattern enumeration. When the target set is substantially smaller than the set of all frequent patterns, GFP-growth exhibits near-linear scalability in the number of targets, memory, and execution time (Shabtay et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Guided FP-Growth.