PhyloLM: Phylogenetic Modeling Frameworks
- PhyloLM is a set of frameworks that integrate phylogenetic methods with language, network, and genomic models to reveal latent evolutionary relationships.
- It employs techniques like Nei-style distance metrics, branching Brownian motion, and Markov substitution models to quantify model divergence and infer modular hierarchies.
- These methods improve predictive accuracy and interpretability across domains, from benchmarking LLM performance to assessing variant pathogenicity in genomic sequences.
PhyloLM denotes several modern frameworks that integrate phylogenetic principles with modeling tasks across LLMs, network analysis, and genomics. Each formulation leverages the concept of evolutionary relationship—whether among species, neural network models, or network nodes—to improve inference, interpretability, or benchmarking. The three principal developments of PhyloLM, as described in (Yax et al., 6 Apr 2024, Pavone et al., 17 Feb 2025), and (Albors et al., 4 Mar 2025), illustrate the versatility of phylogenetic methods across domains: (1) mapping the evolutionary landscape of LLMs, (2) inferring nested modular hierarchies in network data via latent feature evolution, and (3) training genomic LLMs by directly optimizing phylogenetic likelihoods.
1. PhyloLM for Benchmarking and Relationship Discovery among LLMs
The approach of treating LLMs as populations whose “genetic” makeup is reflected in their generative behavior was introduced in (Yax et al., 6 Apr 2024). This implementation adapts phylogenetic algorithms from population genetics to address two challenges: systematically comparing LLMs and predicting their performance in benchmarks given only black-box access, with minimal reliance on explicit training information.
Conceptual Framework:
- LLMs as Populations: Each LLM is regarded as a population of probabilistic token-generation behaviors.
- Token Distributions as Allele Frequencies: For a fixed set of short contexts (“genes”), models are probed for next-token distributions (“alleles”).
- Distance Metric: The normalized Nei-style similarity,
quantifies behavioral divergence across models and .
Workflow Steps:
- Careful selection of probing contexts (using, e.g., OpenWebMath for reasoning, MBXP for code); each model is sampled repeatedly in these “genes.”
- The output token distributions are compared and assembled into a distance matrix.
- Neighbor-Joining or similar algorithms reconstruct dendrograms of LLM “phylogeny,” visualizing lineage, fine-tuning clusters, and data-sharing patterns.
Predictive Modeling:
PhyloLM embeds models into a low-dimensional space using multidimensional scaling on and then fits a small neural regressor to predict performance on benchmarks such as MMLU and ARC. High correlation () between predicted and actual scores highlights the informativeness of phylogenetic distances for performance estimation.
Empirical Patterns:
- Families (e.g., Llama, Mistral) cluster by version or fine-tuning approach.
- Isolated evolutionary branches (e.g., GPT-4) indicate distinctive generative signatures.
- Correlated behavioral landscapes can be inferred without explicit model access or training data.
Limitations: The exactness of genetic distance metrics may be sensitive to the choice of probing contexts and the granularity of token alignment. The phylogenetic assumption of tree-like evolution can break down due to fine-tuning and convergent model development, suggesting the need for alternative clustering or network-based representations (Yax et al., 6 Apr 2024).
2. Phylogenetic Latent Space Models for Network Data
The latent space interpretation of networks is augmented in (Pavone et al., 17 Feb 2025) by assigning each node a feature vector that is not only optimized for likelihood but also modeled as evolving along a phylogenetic tree via branching Brownian motion. This approach, labeled as a phylogenetic latent space model (“PhyloLM”), enables the joint inference of both latent positions and the multiscale modular hierarchy among nodes.
Key Model Elements:
- Data Likelihood: Each edge is Bernoulli with parameter , encouraging connection probability to decrease with feature-space distance.
- Phylogenetic Prior: Latent features at the leaves are jointly distributed as , where shared ancestry (encoded via the ultrametric rooted tree ) induces covariance .
- Tree Inference: The tree is inferred via Bayesian sampling with Metropolis-within-Gibbs MCMC, using birth-death process priors and supporting flexible topologies.
Theoretical Guarantees:
- Identifiability: For , the edge-probability structure is uniquely determined (up to rotation/translation of features), and the parameters are identifiable.
- Posterior Consistency: For multiple networks on a fixed node set, the posterior concentrates on the true as the number of observed networks increases.
Multiscale Structure:
- Branching events at the root isolate coarsest modules; subsequent splits define finer, nested submodules. Edge probabilities directly reflect these hierarchies via latent distances.
Empirical Demonstration:
| Data Type | Number of Nodes (V) | Networks (M) | Recovered Structure |
|---|---|---|---|
| ’Ndrangheta criminal group | 84 | 10 | “Locali” clusters, role subclades, microstructure |
| Brain connectome | 68 | 40 | Frontal/posterior, hemispheric, lobar, limbic branches |
This granularity in modularity is not recovered by blockmodels or standard (“flat prior”) latent spaces (Pavone et al., 17 Feb 2025).
Extensions: Potential generalizations include Poisson/count models for weighted graphs, trait–covariate hybrid trees, non-Euclidean latent spaces, dynamic/multilayer settings, and tree–blockmodel hybrids.
3. Phylogenetic Language Modeling in Genomics
The genomic PhyloLM framework unifies evolutionary modeling and deep learning for DNA sequence analysis, as realized in the PhyloGPN model of (Albors et al., 4 Mar 2025). Here, masked language modeling is reframed as predicting the parameters of an explicit phylogenetic substitution model.
Core Methodology:
- Input: A one-hot encoded DNA window (4 × 481 bp) from the human genome, with aligned columns from up to 447 placental mammals, and the corresponding phylogenetic tree .
- Objective: Predict F81 model parameters governing nucleotide substitution rates and the stationary distribution .
- Phylogenetic Likelihood: The model maximizes the conditional log-likelihood , using Felsenstein’s pruning algorithm for efficient marginalization over ancestral states.
Loss and Regularization:
- The main loss combines negative conditional log-likelihood and a root-state regularizer to avoid degeneracy:
- To ensure numerical stability for large branch lengths, the probability-of-substitution is lower-bounded by .
Architecture:
- Adaptation of ByteNet/CARP backbone for 1D DNA sequence: 40 dilated residual conv blocks (kernel size 7, dilation exponential), embedding dimension 960, and enforced reverse-complement equivariance by tied weights.
- 83M parameters (41M free after RCE).
Training Regime:
- Data: 447-mammal Zoonomia whole-genome alignment (WGA) to the human reference; windows sampled to balance sex chromosomes and autosomes.
- Optimizer: AdamW, LR , batch size 12 windows per GPU, trained on 4×A100 GPUs for 18 epochs.
Inference:
- Once trained, PhyloGPN requires only the single-sequence window for prediction, enabling broad applicability without multi-species alignments at prediction time.
Performance Benchmarks:
| Task | Metric | PhyloGPN | Best Baseline |
|---|---|---|---|
| ClinVar (pathogenicity) | AUROC | 0.94–0.97 | 0.55–0.72 |
| OMIM regulatory (pathogenic vs. common) | AUPRC | 0.25–0.45 | 0.10–0.30 |
| DMS protein assays (average) | Spearman ρ | 0.25 | ~0.05 |
| BEND (gene finding) | MCC | 0.68 | 0.68 |
| Chromatin accessibility | AUROC | 0.86 | = or < 0.86 |
| Disease VEP (embeddings) | AUROC | 0.96 | 0.77 (Nucleotide Transf.) |
Ablations indicate robust generalization, with half-genome training sufficient to approach full-data performance.
Limitations: Short receptive fields (481 bp) limit long-range context utility, though pooling over longer spans (PhyloGPN-X) partially mitigates this. The F81 model, though tractable, is less expressive than more complex nucleotide substitution models (e.g., GTR), suggesting that further expressivity could capture richer context-dependent biases (Albors et al., 4 Mar 2025).
4. Comparative Table of PhyloLM Methodologies
| Domain | Structure Modeled | Phylogenetic Mechanism | Output/Insight |
|---|---|---|---|
| LLM Benchmarking (Yax et al., 6 Apr 2024) | Model relationships | Nei-style genetic distance; dendrogram | Lineage, cluster, performance prediction |
| Network Latent Space (Pavone et al., 17 Feb 2025) | Node features, modularity | BBM on tree; latent feature evolution | Hierarchies, uncertainty quantification |
| Genomic Modeling (Albors et al., 4 Mar 2025) | Sequence evolution | Markov substitution on known phylogeny | Variant pathogenicity, functional annotation |
5. Extensions, Limitations, and Outlook
PhyloLM, as reflected in these frameworks, demonstrates the power of phylogenetic abstraction: treating heterogeneous objects—be it neural models, network nodes, or DNA loci—as units with shared ancestry or behavior. Limitations are domain-specific: metric sensitivity and tree assumptions for model comparison (Yax et al., 6 Apr 2024), tractability and expressivity for network and genomic settings (Pavone et al., 17 Feb 2025, Albors et al., 4 Mar 2025). Extensions include incorporating broader or cross-domain data, more expressive or multitrait substitution processes, dynamic network and sequence evolution, and non-treelike or non-Euclidean relationships.
A plausible implication is that as phylogenetic methods broaden into machine learning and data science, they will provide principled frameworks for interpretability, zero-cost benchmarking, hierarchical modularity, and transfer learning. These advances continue to blur the boundaries between evolutionary theory and statistical learning.