Guided FP-Growth: Targeted Pattern Mining
- Guided FP-Growth is a targeted mining method that computes exact occurrence counts for specified itemsets by restricting FP-tree traversal to relevant branches.
- It employs a TIS-tree structure to guide the mining process, thereby minimizing computational overhead and memory usage compared to conventional FP-growth.
- Empirical evaluations show dramatic improvements, with orders-of-magnitude speedups in rule mining for imbalanced datasets, as demonstrated by the Minority-Report Algorithm.
Guided FP-Growth (GFP-growth) is a method for multitude-targeted mining that computes the exact occurrence counts of a specified, potentially large set of target itemsets within transactional data. By restricting tree traversal to only those branches relevant to user-specified queries, GFP-growth achieves dramatic gains in time and memory efficiency over traditional FP-growth pattern mining. Its theoretical guarantees and practical implementations make it particularly effective for applications such as rule mining in highly imbalanced datasets, as demonstrated in the Minority-Report Algorithm (Shabtay et al., 2018).
1. Formal Definition of the Multitude-Targeted Mining Problem
Let denote the universe of items, and a transaction database, where each . An itemset has support , where is the number of transactions containing . The multitude-targeted mining problem is defined as:
Given:
- An FP-tree constructed over , either for all items or items exceeding a minimum support,
- A collection of 0 target itemsets,
Compute 1 for each 2, traversing only the parts of the FP-tree that are required for these queries.
This restriction distinguishes GFP-growth from standard frequent pattern mining, which enumerates all frequent patterns, often resulting in superfluous computation.
2. GFP-Growth Algorithm Structure
GFP-growth coordinates the traversal of two trees:
- The FP-tree (3), where the header table 4 maps items to their tree node linked lists,
- The Target Item-Set tree (TIS-tree), where nodes correspond to all prefixes of itemsets in 5.
Each TIS-tree node 6 encapsulates:
- 7: the occurrence count associated with 8 (initialized to 9),
- 0: a boolean indicating if 1.
Traversal proceeds in support-ascending order, such that the path in the TIS-tree mirrors the bottom-up conditional pattern-based expansion in FP-growth. The algorithm's core is as follows (proxying comments from the original pseudocode):
- Iterate over children 2 of the current TIS-node in support order.
- If 3 is nonempty, accumulate counts over 4.
- For target nodes, record 5.
- If children exist, recursively construct a filtered conditional FP-tree using only items needed in the relevant subtree (FilteredItemSet), then recurse.
This structure ensures that only the minimal necessary subtrees are constructed and counts computed exclusively for the set 6.
3. Guided Conditional-Pattern-Base Construction
A central optimization is the guided construction of conditional pattern bases:
- For a prefix 7 at a TIS-node, a compact structure (bitmap or hash set) enumerates the “FilteredItemSet”—items that appear in any descendant target of 8.
- During conditional FP-tree construction, any item 9 FilteredItemSet is omitted.
This pruning substantially limits the size of conditional trees, focusing compute effort on only those transactions and items relevant for the downstream targets. This strategy contrasts with conventional FP-growth, which would otherwise include all co-occurring items at each recursion.
4. Correctness Guarantees
The main correctness theorem asserts:
Upon GFP-growth termination, for every target node 0 with itemset 1, 2 in 3.
Proof (sketch): By induction, the FP-subtree rooted at any prefix 4 faithfully represents transactions containing 5. Recursing only when 6 is nonempty ensures that 7 is counted exactly, and guided conditional FP-trees remain correct due to the induction assumption. If 8, no further recursion occurs and 9 remains zero.
5. Complexity Analysis
Let:
- 0 (transactions),
- 1 (items),
- 2 (average transaction length),
- 3 (number of targets),
- 4 (average length of target itemsets),
- 5 (number of FP-tree nodes actually visited by GFP-growth).
Time and memory requirements are:
| Operation | Time Complexity | Memory Complexity |
|---|---|---|
| FP-tree construction | 6 | 7 |
| TIS-tree over 8 targets | — | 9 |
| GFP-growth recursion | 0 (pruned) | 1 (TIS-tree plus bitmaps) |
Overall,
- Time(GFP-growth) = 2
- Time(FP-growth) = 3
with 4 whenever 5 is much less than the total frequent patterns. Memory overhead is limited to the TIS-tree and lightweight filtering structures at each TIS-node.
6. Optimizations and Implementation Considerations
- Header table lookups are 6, avoiding unnecessary subtree evaluations.
- Conditional tree construction is omitted for TIS-leaf nodes.
- Each TIS-node maintains its FilteredItemSet, minimizing conditional tree size.
- Exact counting is performed only for target nodes, reducing unnecessary accumulation for intermediate prefixes.
- Consistent item ordering (support-descending for FP-tree, support-ascending for TIS-tree) supports memory locality and avoids additional sorting.
These choices collectively enhance practical efficiency and scalability.
7. Application: The Minority-Report Algorithm
The Minority-Report Algorithm leverages GFP-growth for targeted rule mining in class-imbalanced datasets. It finds all association rules of the form 7 (minority class), exceeding minimum support 8 and confidence 9. Steps include:
- Compute occurrence counts 0 for items in the rare class subset 1; retain those with 2.
- Construct FP-trees FP3 (majority) and FP4 (minority) over the reduced item universe.
- Mine FP5 for all 6 with 7 above the threshold; insert into TIS-tree.
- Use GFP-growth on FP8 to assign 9 counts.
- For each 0, compute 1 and output rules that meet 2.
Correctness follows from FP-growth finding all minority-class frequent itemsets and GFP-growth yielding corresponding majority counts. Resulting support and confidence values are exact, and all qualifying rules are output.
8. Empirical Evaluation
Implementation was performed by extending Christian Borgelt's C FP-growth codebase. Experiments covered both synthetic Bernoulli-model and real (UCI Adult/Census Income) datasets, with varying rarity of the minority class (3). Table 1 summarizes representative speedup results of GFP-growth vs. FP-growth.
| Scenario | 4 | Transactions | Speedup (FP/GFP) |
|---|---|---|---|
| Simulation (5, 6) | 0.01 | 50,000 | 20× – 80× |
| Simulation (7, 8) | 0.10 | 50,000 | 5× – 20× |
| Census Income (real data) | 0.05 | 22,500 | 30× – 50× |
As the target class becomes rarer, the gains of GFP-growth increase markedly, confirming that focused, guided mining of pre-specified itemsets achieves orders-of-magnitude performance improvement over exhaustive frequent pattern enumeration. When the target set is substantially smaller than the set of all frequent patterns, GFP-growth exhibits near-linear scalability in the number of targets, memory, and execution time (Shabtay et al., 2018).