Papers
Topics
Authors
Recent
Search
2000 character limit reached

MurTree: Optimal Trees & Topological Analysis

Updated 27 February 2026
  • MurTree is a collection of dynamic programming techniques for optimal decision tree learning and robust merge tree edit distance computation.
  • It integrates branch-and-bound search with caching and tight lower bounds to efficiently minimize misclassification costs under depth and node constraints.
  • In computational topology, MurTree's edit distance framework enhances applications such as shape matching, symmetry detection, and flow summarization.

MurTree refers to two distinct contributions in computational topology and machine learning: (1) a high-performance dynamic programming algorithm for optimal classification tree induction, and (2) an algorithmic framework for computing tree edit distances between merge trees, a key data structure in topological analysis of scalar fields. Both lines of work share a unifying theme of dynamic programming on trees, but are situated in separate research domains. This entry addresses both algorithmic frameworks, with an emphasis on their definitions, core methodologies, mathematical principles, scalability, and empirical performance.

1. MurTree for Optimal Classification Tree Learning

1.1 Problem Definition

The MurTree classification approach addresses the globally optimal learning of decision trees under explicit size and depth constraints. Given a training set D\mathcal{D} of NN labeled instances with binary labels {+1,1}\{+1, -1\} over a fixed set of binary features F\mathcal{F}, the goal is to induce a full binary decision tree TT. Each internal node tests a predicate fFf \in \mathcal{F}, and each leaf predicts +1+1 or 1-1. The optimization criterion is the minimization of misclassification cost, defined for any leaf containing subset DD\mathcal{D}' \subseteq \mathcal{D} as:

min{{xD:label(x)=+1},{xD:label(x)=1}},\min \left\{ \left| \{ x \in \mathcal{D}': label(x) = +1 \} \right|, \left| \{ x \in \mathcal{D}': label(x) = -1 \} \right| \right\},

with the cost for the tree given by the sum over all leaves. Optionally, a linear penalty α0\alpha \geq 0 per internal node penalizes tree size. Hard constraints are placed on the maximum tree depth DD and the number of internal nodes nn (Demirović et al., 2020).

1.2 Dynamic Programming Recurrence

MurTree's core is a dynamic programming (DP) decomposition. Let T(D,d,n)T(\mathcal{D}, d, n) denote the minimum misclassification cost achievable on data subset D\mathcal{D}, using a binary tree of depth at most dd and at most nn internal nodes. The recurrence is:

T(D,d,n)={T(D,d,2d1)if n>2d1, T(D,n,n)if d>n, min{#+(D),#(D)}if d=0n=0, minfF,0n1[T(Df=0,d1,)+T(Df=1,d1,n1)]otherwise,T(\mathcal{D}, d, n) = \begin{cases} T(\mathcal{D}, d, 2^d-1) & \text{if } n > 2^d-1, \ T(\mathcal{D}, n, n) & \text{if } d > n, \ \min\{|\#^+(\mathcal{D})|, |\#^-(\mathcal{D})|\} & \text{if } d=0 \vee n=0, \ \displaystyle \min_{f \in \mathcal{F}, 0 \leq \ell \leq n-1} \Big[ T(\mathcal{D}_{f=0}, d-1, \ell) + T(\mathcal{D}_{f=1}, d-1, n-1-\ell) \Big] & \text{otherwise,} \end{cases}

with Df=0\mathcal{D}_{f=0} and Df=1\mathcal{D}_{f=1} denoting instance subsets for feature outcome $0$ and $1$, respectively.

1.3 Branch-and-Bound and Search Pruning

MurTree combines the DP with branch-and-bound search strategies. Key optimization techniques include:

  • Caching: Every subproblem T(D,d,n)T(\mathcal{D}, d, n) is memoized using a hash table.
  • Upper Bound (UB) Pruning: Early discovery of a solution with cost CC^* sets UBC1UB \leftarrow C^*-1, pruning subtrees with lower bound >UB>UB.
  • Lower Bounds:

    1. Stored-bound: If a call is proven infeasible within UBUB, lb(D,d,n)UB+1lb(\mathcal{D}, d, n) \gets UB + 1 (DL8.5-style).
    2. Similarity-based lower bound:

    simLB(Dnew,Dold,d,n)=T(Dold,d,n)DoldDnew.\mathrm{simLB}(\mathcal{D}_\mathrm{new}, \mathcal{D}_\mathrm{old}, d, n) = T(\mathcal{D}_\mathrm{old}, d, n) - |\mathcal{D}_\mathrm{old} \setminus \mathcal{D}_\mathrm{new}|.

  1. Local refinement:

    locLB(D,d,n)=minf,(lb(Df=0,d1,)+lb(Df=1,d1,n1)).\mathrm{locLB}(\mathcal{D}, d, n) = \min_{f, \ell} \bigl( lb(\mathcal{D}_{f=0}, d-1, \ell) + lb(\mathcal{D}_{f=1}, d-1, n-1-\ell) \bigr).

  • Degeneracy pruning: Splits that do not partition the data are skipped.
  • Dynamic node-order: The branch with larger single-leaf cost is searched first, to increase the likelihood of early UB exceedance.

1.4 Constraint Handling

Both the depth DD and node count nn are integrated into the DP state and recurrences, ensuring that child nodes always obey their respective budget splits: 2d11\ell \leq 2^{d-1}-1 and n12d11n-1-\ell \leq 2^{d-1}-1 (Demirović et al., 2020).

1.5 Complexity and Empirical Scalability

The general DP without acceleration is O(NFnd)O(N \cdot |\mathcal{F}| \cdot n \cdot d). An optimized depth-2 specialization (using precomputed feature and pairwise-feature frequency tables) reduces major subcalls by 10×10\times100×100\times and handles the majority of subproblems in constant time. Aggressive caching and tight lower bounds empirically prune 99%99\% of the search space. On standard UCI/C4.5 benchmarks (depth-4 trees), MurTree can solve all tasks within <60<60s (often <1<1s), outperforming DL8.5 by up to two orders of magnitude (Demirović et al., 2020).

2. MurTree Tree Edit Distance for Merge Trees

2.1 Merge Trees: Background and Definition

Merge trees encode the evolution of connected components in the sublevel (join tree) or superlevel (split tree) sets of a scalar field f:XRf: X \to \mathbb{R}. Each node corresponds to a critical point, and edges to merging events as the level set parameter increases. Nodes are labeled with scalar values (birth/death times in persistence). Merge trees are rooted, with an explicit binary structure reflecting topological evolution (Sridharamurthy et al., 2022).

2.2 Edit Distance: Cost Model

The MurTree edit distance is a minimum-cost sequence of node-wise edit operations (insertions, deletions, relabelings) aligning trees T1T_1 and T2T_2. Node costs use a metric γ\gamma:

  • Deletion: γ(p)=12(dpbp)\gamma(p \to \varnothing) = \frac{1}{2}(d_p - b_p)
  • Insertion: γ(q)=12(dqbq)\gamma(\varnothing \to q) = \frac{1}{2}(d_q - b_q)
  • Relabel: γ(pq)=min{max(bpbq,dpdq),12(dpbp)+12(dqbq)}\gamma(p \to q) = \min\{ \max(|b_p - b_q|, |d_p - d_q|), \frac{1}{2}(d_p - b_p) + \frac{1}{2}(d_q - b_q) \}

where bp,dpb_p, d_p are birth and death for node pp. The overall tree edit distance is computed as the minimal total operation cost under valid node matchings respecting tree structure (Sridharamurthy et al., 2022).

2.3 Dynamic Programming Algorithm

The MurTree DP algorithm extends Zhang’s unordered-tree edit distance. Trees T1T_1 and T2T_2 are traversed in postorder. DP tables D[i,j]D[i, j] store optimal edit cost from the subtree rooted at node ii in T1T_1 to subtree jj in T2T_2. The core recurrences comprise:

  • Base cases for deletion/insertion of whole subtrees.
  • For nontrivial subtrees (i,j)(i, j), three strategies:

    1. Delete ii and optimally match its children's forests.
    2. Insert jj and optimally match its children's forests.
    3. Match roots, then solve a bipartite matching problem on children.

Forest matching uses the Hungarian method (or variants) with time O((Δ1+Δ2)3)O((\Delta_1 + \Delta_2)^3) per match, Δ\Delta being max degree. Total time is O(n1n2(Δ1+Δ2)3)O(n_1 n_2 (\Delta_1 + \Delta_2)^3) for trees with n1,n2n_1, n_2 nodes (Sridharamurthy et al., 2022).

2.4 Implementation and Optimization

To enhance stability and performance, several optimizations are applied:

  • Small persistence intervals (<ϵ)(<\epsilon) are merged for robustness.

  • Implementation caches DP table entries and recycles solutions for isomorphic subtrees.
  • Parallelization across independent DP blocks enables significant multicore speedups.

2.5 Applications and Experimental Highlights

The MurTree edit distance demonstrates utility in a range of topological data analysis (TDA) tasks:

  • Periodicity detection: Outperforms bottleneck and Wasserstein distances in temporal periodicity discovery in vortex street simulations.
  • Stability with respect to smoothing/subsampling: Maintains monotonicity except in degenerate barcode scenarios.
  • Symmetry detection: Detects group equivalence in synthetic and cryo-EM datasets, with block-diagonal distance matrices.
  • Shape matching: Clusters pose-varying meshes by class in TOSCA datasets, insensitive to pose changes.
  • Flow summarization: Segments temporal regimes in 3D flow simulations.

Pairwise distance computation for  60~60-node trees over 10610^6 pairs completes within practical timeframes (e.g., $25$ minutes on 8-core hardware; 4×4\times8×8\times acceleration via optimized code) (Sridharamurthy et al., 2022).

3. Realization of Merge Trees and Discrete Morse Functions

Merge trees can be realized via discrete Morse functions on trees and, notably, on paths. Each abstract merge tree corresponds to a discrete Morse function (critical-only, possibly index-ordered or sublevel-connected) on a path, and vice versa, modulo natural equivalence relations (symmetry, shuffle, or component–merge equivalence). These constructions enable explicit and bijective correspondence between merge trees and discrete Morse function classes, underpinning combinatorial and algorithmic analysis (Brüggemann, 2021).

4. Comparative Analysis with Existing Methods

MurTree for optimal classification trees substantially outperforms prior state-of-the-art solvers (notably DL8.5) in both runtime and scalability. For d=4d=4 optimal trees on over $80$ UCI/C4.5 datasets, MurTree achieves solution times 10×10\times100×100\times lower, solves all datasets where others time out, and scales linearly with dataset size up to N=40,000N=40,000 (Demirović et al., 2020).

In topological analysis, the MurTree edit metric offers richer discrimination of merge tree structure than bottleneck or W1W_1 distances. It is robust to function perturbations and supports nuanced applications such as symmetry detection and fine-grained shape clustering (Sridharamurthy et al., 2022).

5. Summary of Key Properties and Implications

MurTree constitutes a class of dynamic-programming based algorithms for tree-structured problems:

Domain Objective Key Feature
Classification Exact decision tree optimization Primal DP with branch-and-bound, tight lower bounds
TDA/Topology Merge-tree edit distance Metric cost aligning trees and persistence intervals

Both algorithms leverage structural decomposability of trees, advanced memoization, problem-specific lower bounds, and efficient matching strategies. MurTree's solution paradigms enable handling of large-scale, high-dimensional datasets and precise topological summaries. These developments comprise critical advances towards tractable, exact combinatorial learning models and robust metrics for geometric and topological data analysis (Demirović et al., 2020, Sridharamurthy et al., 2022, Brüggemann, 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MurTree.