Log-Concave MLE on Tree Spaces
- Log-concave MLE is a nonparametric method that estimates density functions on tree spaces without manual tuning, leveraging the flexibility of concave log transformations.
- The approach exploits the unique geometry of phylogenetic tree spaces by translating the estimation problem into convex and concave hull computations in low-dimensional settings.
- Empirical comparisons reveal that the log-concave MLE outperforms kernel methods in accuracy and adaptability, particularly in clustering and support inference for complex phylogenetic data.
Maximum likelihood estimation of log-concave densities on tree space extends the nonparametric log-concave maximum likelihood framework from Euclidean space to spaces of phylogenetic trees, which are nonpositively curved metric (Hadamard) spaces. Log-concave densities—those whose logarithm is concave—are attractive since they form a flexible nonparametric class requiring no manual selection of tuning or smoothing parameters and admit a well-defined maximization problem. The approach allows for direct nonparametric estimation of the complex distributions observed in samples of phylogenetic trees, bypassing the need for explicit parametric modeling. This is particularly relevant in biological applications, where sample trees inferred from data (e.g., via phylogenetic reconstruction) can exhibit high variability and nonstandard features.
1. Mathematical Framework for Log-Concave MLE on Tree Space
The log-concave MLE is defined over the class of upper-semicontinuous log-concave densities with respect to a fixed base measure on the tree space . For a sample , the log-likelihood is
where ranges over the admissible class. The existence and uniqueness problem is studied in low-dimensional tree spaces:
- T₃ (1D case): The space is formed by three half-lines meeting at a common origin, representing all possible rooted phylogenetic trees with three leaves. For , the log-concave MLE exists and is unique with probability one.
- T₄ (2D case) and higher: The space is composed of multiple 2D Euclidean orthants glued along faces, and their connections can be described combinatorially (e.g., by the Petersen graph for T₄). The sufficient condition for existence and uniqueness relies on (a) the convex hull of the sample not including any "boundary" points from outside any orthant, (b) the intersection of the convex hull with each orthant having positive measure, and (c) specific connectivity properties of the convex hull between orthants.
The MLE is parameterized as , where is the least upper-semicontinuous concave function satisfying . The log-likelihood maximization problem reduces to
over , with a convexified modification in (18) to guarantee convexity.
2. Algorithmic Implementation
One Dimension (T₃)
- The data is represented as points on three half-lines meeting at the origin; the log-density is specified at data points and elsewhere.
- The computation reduces to finding the concave hull of points in T₃. This can be mapped to a convex hull calculation in .
- Standard convex hull algorithms from Euclidean geometry yield the function exactly.
Two Dimensions (T₄)
- The density must be concave on a space comprising multiple connected Euclidean orthants.
- The algorithm iteratively constructs "skeleton" sets and approximates the convex hull at each step, by computing geodesic (cone) paths between points, handling boundary intersections, and applying convex hull routines in appropriate lower-dimensional Euclidean subspaces.
- The procedure continues until convergence, yielding an approximation to the concave hull and thus the MLE.
General Structure
- Both algorithms involve selecting or updating the vector (heights at sampled points) and evaluating the integral term, which is tractable using the geometric properties of tree space in low dimensions.
3. Statistical and Computational Properties
- Existence and Uniqueness: For T₃, the log-concave MLE exists and is unique almost surely for ; for higher-dimensional tree spaces, the conditions on the sample convex hull ensure existence and almost-everywhere uniqueness with respect to the base measure.
- Optimization: The parameterization and convexity properties enable efficient optimization using standard numerical routines once the geometric structure is established.
- Comparison to Kernel Methods: The log-concave MLE (LCMLE) requires no bandwidth or smoothing parameter tuning, automatically adapts to unknown support, and estimates densities directly from the data.
4. Empirical Performance and Comparisons
Extensive simulation experiments compare the LCMLE with kernel density estimation (KDE) in both one- and two-dimensional tree spaces:
- In T₃ (1D), for both normal-like and exponential-like densities, the LCMLE achieves lower integrated squared error (ISE) than KDE, especially as the sample size increases.
- In T₄ (2D), two scenarios are studied:
- For densities with full support, LCMLE eventually outperforms KDE as sample size increases; for small samples, KDE may have an edge.
- For densities supported only in a subset of orthants, the LCMLE outperforms KDE even for smaller samples, attributed to its ability to correctly infer the support.
- The ability of the LCMLE to adapt to the true support is emphasized as a key advantage.
5. Applications: Clustering and Density Estimation with Bends
- Clustering: The method integrates LCMLE into a mixture model on T₄ for clustering. Each cluster is modeled by a log-concave density, and an EM algorithm is used:
- The E-step computes posterior cluster probabilities.
- The M-step maximizes the modified log-likelihood and updates mixture proportions.
- The LCMLE-based EM algorithm is compared to k-means++ (using the Fréchet mean as centroid) for clustering phylogenetic trees. In a benchmark example, the LCMLE-based clustering achieves higher accuracy (89% versus 77% for k-means++).
- Boundary Densities: The framework is extended to handle densities with "bending" at the origin (as occur in Brownian motion or coalescent models) by relaxing the strict concavity constraint at the root, enabling the class of such densities. The uniqueness and existence of the MLE remain under analogous conditions, and performance remains strong relative to KDE.
6. Broader Implications and Future Research
The log-concave MLE framework on tree space exhibits several notable features:
- It enables nonparametric density estimation for phylogenetic trees in a principled manner without manual regularization or support selection.
- The methodology is especially suited for settings where the underlying distribution is complex or nonstandard, as occurs with inferred trees in evolutionary biology.
- The framework offers improved interpretability, support-adaptivity, and accuracy—most pronounced in large-sample or support-mismatch regimes—for density and clustering applications.
- The extension to higher-dimensional tree spaces poses computational and theoretical challenges, and proving consistency akin to Euclidean nonparametric MLEs remains an open problem.
- The approach opens avenues for new nonparametric statistical tools and clustering algorithms in non-Euclidean settings, not limited to evolutionary biology but potentially applicable in any domain where Hadamard-type spaces arise.
7. Summary Table: Key Features of Log-Concave MLE on Tree Space
Aspect | Log-Concave MLE | Kernel Density Estimator |
---|---|---|
Tuning needed | None (no bandwidth) | Bandwidth selection required |
Support | Inferred from data | Often fixed a priori |
Adaptivity | Automatically adapts | May oversmooth/jagged on boundary |
Uniqueness | Yes (under mild conditions) | No |
Scalability | Exact (T₃); Approximate (T₄) | Fast, but less adaptive |
The log-concave MLE provides a theoretically and empirically justified route for nonparametric density estimation and clustering in tree space, addressing the particular challenges posed by the inherent non-Euclidean geometry of phylogenetic data (Takazawa et al., 2022).