MurTree: Optimal Trees & Topological Analysis
- MurTree is a collection of dynamic programming techniques for optimal decision tree learning and robust merge tree edit distance computation.
- It integrates branch-and-bound search with caching and tight lower bounds to efficiently minimize misclassification costs under depth and node constraints.
- In computational topology, MurTree's edit distance framework enhances applications such as shape matching, symmetry detection, and flow summarization.
MurTree refers to two distinct contributions in computational topology and machine learning: (1) a high-performance dynamic programming algorithm for optimal classification tree induction, and (2) an algorithmic framework for computing tree edit distances between merge trees, a key data structure in topological analysis of scalar fields. Both lines of work share a unifying theme of dynamic programming on trees, but are situated in separate research domains. This entry addresses both algorithmic frameworks, with an emphasis on their definitions, core methodologies, mathematical principles, scalability, and empirical performance.
1. MurTree for Optimal Classification Tree Learning
1.1 Problem Definition
The MurTree classification approach addresses the globally optimal learning of decision trees under explicit size and depth constraints. Given a training set of labeled instances with binary labels over a fixed set of binary features , the goal is to induce a full binary decision tree . Each internal node tests a predicate , and each leaf predicts or . The optimization criterion is the minimization of misclassification cost, defined for any leaf containing subset as:
with the cost for the tree given by the sum over all leaves. Optionally, a linear penalty per internal node penalizes tree size. Hard constraints are placed on the maximum tree depth and the number of internal nodes (Demirović et al., 2020).
1.2 Dynamic Programming Recurrence
MurTree's core is a dynamic programming (DP) decomposition. Let denote the minimum misclassification cost achievable on data subset , using a binary tree of depth at most and at most internal nodes. The recurrence is:
with and denoting instance subsets for feature outcome $0$ and $1$, respectively.
1.3 Branch-and-Bound and Search Pruning
MurTree combines the DP with branch-and-bound search strategies. Key optimization techniques include:
- Caching: Every subproblem is memoized using a hash table.
- Upper Bound (UB) Pruning: Early discovery of a solution with cost sets , pruning subtrees with lower bound .
- Lower Bounds:
- Stored-bound: If a call is proven infeasible within , (DL8.5-style).
- Similarity-based lower bound:
- Local refinement:
- Degeneracy pruning: Splits that do not partition the data are skipped.
- Dynamic node-order: The branch with larger single-leaf cost is searched first, to increase the likelihood of early UB exceedance.
1.4 Constraint Handling
Both the depth and node count are integrated into the DP state and recurrences, ensuring that child nodes always obey their respective budget splits: and (Demirović et al., 2020).
1.5 Complexity and Empirical Scalability
The general DP without acceleration is . An optimized depth-2 specialization (using precomputed feature and pairwise-feature frequency tables) reduces major subcalls by – and handles the majority of subproblems in constant time. Aggressive caching and tight lower bounds empirically prune of the search space. On standard UCI/C4.5 benchmarks (depth-4 trees), MurTree can solve all tasks within s (often s), outperforming DL8.5 by up to two orders of magnitude (Demirović et al., 2020).
2. MurTree Tree Edit Distance for Merge Trees
2.1 Merge Trees: Background and Definition
Merge trees encode the evolution of connected components in the sublevel (join tree) or superlevel (split tree) sets of a scalar field . Each node corresponds to a critical point, and edges to merging events as the level set parameter increases. Nodes are labeled with scalar values (birth/death times in persistence). Merge trees are rooted, with an explicit binary structure reflecting topological evolution (Sridharamurthy et al., 2022).
2.2 Edit Distance: Cost Model
The MurTree edit distance is a minimum-cost sequence of node-wise edit operations (insertions, deletions, relabelings) aligning trees and . Node costs use a metric :
- Deletion:
- Insertion:
- Relabel:
where are birth and death for node . The overall tree edit distance is computed as the minimal total operation cost under valid node matchings respecting tree structure (Sridharamurthy et al., 2022).
2.3 Dynamic Programming Algorithm
The MurTree DP algorithm extends Zhang’s unordered-tree edit distance. Trees and are traversed in postorder. DP tables store optimal edit cost from the subtree rooted at node in to subtree in . The core recurrences comprise:
- Base cases for deletion/insertion of whole subtrees.
- For nontrivial subtrees , three strategies:
- Delete and optimally match its children's forests.
- Insert and optimally match its children's forests.
- Match roots, then solve a bipartite matching problem on children.
Forest matching uses the Hungarian method (or variants) with time per match, being max degree. Total time is for trees with nodes (Sridharamurthy et al., 2022).
2.4 Implementation and Optimization
To enhance stability and performance, several optimizations are applied:
Small persistence intervals are merged for robustness.
- Implementation caches DP table entries and recycles solutions for isomorphic subtrees.
- Parallelization across independent DP blocks enables significant multicore speedups.
2.5 Applications and Experimental Highlights
The MurTree edit distance demonstrates utility in a range of topological data analysis (TDA) tasks:
- Periodicity detection: Outperforms bottleneck and Wasserstein distances in temporal periodicity discovery in vortex street simulations.
- Stability with respect to smoothing/subsampling: Maintains monotonicity except in degenerate barcode scenarios.
- Symmetry detection: Detects group equivalence in synthetic and cryo-EM datasets, with block-diagonal distance matrices.
- Shape matching: Clusters pose-varying meshes by class in TOSCA datasets, insensitive to pose changes.
- Flow summarization: Segments temporal regimes in 3D flow simulations.
Pairwise distance computation for -node trees over pairs completes within practical timeframes (e.g., $25$ minutes on 8-core hardware; – acceleration via optimized code) (Sridharamurthy et al., 2022).
3. Realization of Merge Trees and Discrete Morse Functions
Merge trees can be realized via discrete Morse functions on trees and, notably, on paths. Each abstract merge tree corresponds to a discrete Morse function (critical-only, possibly index-ordered or sublevel-connected) on a path, and vice versa, modulo natural equivalence relations (symmetry, shuffle, or component–merge equivalence). These constructions enable explicit and bijective correspondence between merge trees and discrete Morse function classes, underpinning combinatorial and algorithmic analysis (Brüggemann, 2021).
4. Comparative Analysis with Existing Methods
MurTree for optimal classification trees substantially outperforms prior state-of-the-art solvers (notably DL8.5) in both runtime and scalability. For optimal trees on over $80$ UCI/C4.5 datasets, MurTree achieves solution times – lower, solves all datasets where others time out, and scales linearly with dataset size up to (Demirović et al., 2020).
In topological analysis, the MurTree edit metric offers richer discrimination of merge tree structure than bottleneck or distances. It is robust to function perturbations and supports nuanced applications such as symmetry detection and fine-grained shape clustering (Sridharamurthy et al., 2022).
5. Summary of Key Properties and Implications
MurTree constitutes a class of dynamic-programming based algorithms for tree-structured problems:
| Domain | Objective | Key Feature |
|---|---|---|
| Classification | Exact decision tree optimization | Primal DP with branch-and-bound, tight lower bounds |
| TDA/Topology | Merge-tree edit distance | Metric cost aligning trees and persistence intervals |
Both algorithms leverage structural decomposability of trees, advanced memoization, problem-specific lower bounds, and efficient matching strategies. MurTree's solution paradigms enable handling of large-scale, high-dimensional datasets and precise topological summaries. These developments comprise critical advances towards tractable, exact combinatorial learning models and robust metrics for geometric and topological data analysis (Demirović et al., 2020, Sridharamurthy et al., 2022, Brüggemann, 2021).