Papers
Topics
Authors
Recent
2000 character limit reached

PhyloLM: Phylogenetic Modeling Frameworks

Updated 1 December 2025
  • PhyloLM is a set of frameworks that integrate phylogenetic methods with language, network, and genomic models to reveal latent evolutionary relationships.
  • It employs techniques like Nei-style distance metrics, branching Brownian motion, and Markov substitution models to quantify model divergence and infer modular hierarchies.
  • These methods improve predictive accuracy and interpretability across domains, from benchmarking LLM performance to assessing variant pathogenicity in genomic sequences.

PhyloLM denotes several modern frameworks that integrate phylogenetic principles with modeling tasks across LLMs, network analysis, and genomics. Each formulation leverages the concept of evolutionary relationship—whether among species, neural network models, or network nodes—to improve inference, interpretability, or benchmarking. The three principal developments of PhyloLM, as described in (Yax et al., 6 Apr 2024, Pavone et al., 17 Feb 2025), and (Albors et al., 4 Mar 2025), illustrate the versatility of phylogenetic methods across domains: (1) mapping the evolutionary landscape of LLMs, (2) inferring nested modular hierarchies in network data via latent feature evolution, and (3) training genomic LLMs by directly optimizing phylogenetic likelihoods.

1. PhyloLM for Benchmarking and Relationship Discovery among LLMs

The approach of treating LLMs as populations whose “genetic” makeup is reflected in their generative behavior was introduced in (Yax et al., 6 Apr 2024). This implementation adapts phylogenetic algorithms from population genetics to address two challenges: systematically comparing LLMs and predicting their performance in benchmarks given only black-box access, with minimal reliance on explicit training information.

Conceptual Framework:

  • LLMs as Populations: Each LLM is regarded as a population of probabilistic token-generation behaviors.
  • Token Distributions as Allele Frequencies: For a fixed set of short contexts (“genes”), models are probed for next-token distributions (“alleles”).
  • Distance Metric: The normalized Nei-style similarity,

Sij=gGaAgPi(ag)Pj(ag)(gGaAgPi(ag)2)(gGaAgPj(ag)2),Dij=logSijS_{ij} = \frac{\sum_{g \in G} \sum_{a \in A_g} P_i(a|g)P_j(a|g)}{\sqrt{\bigl(\sum_{g \in G} \sum_{a \in A_g} P_i(a|g)^2\bigr)\bigl(\sum_{g \in G} \sum_{a \in A_g} P_j(a|g)^2\bigr)}}, \quad D_{ij} = -\log S_{ij}

quantifies behavioral divergence across models ii and jj.

Workflow Steps:

  • Careful selection of probing contexts (using, e.g., OpenWebMath for reasoning, MBXP for code); each model is sampled repeatedly in these “genes.”
  • The output token distributions are compared and assembled into a distance matrix.
  • Neighbor-Joining or similar algorithms reconstruct dendrograms of LLM “phylogeny,” visualizing lineage, fine-tuning clusters, and data-sharing patterns.

Predictive Modeling:

PhyloLM embeds models into a low-dimensional space using multidimensional scaling on DijD_{ij} and then fits a small neural regressor to predict performance on benchmarks such as MMLU and ARC. High correlation (R20.8R^2 \gg 0.8) between predicted and actual scores highlights the informativeness of phylogenetic distances for performance estimation.

Empirical Patterns:

  • Families (e.g., Llama, Mistral) cluster by version or fine-tuning approach.
  • Isolated evolutionary branches (e.g., GPT-4) indicate distinctive generative signatures.
  • Correlated behavioral landscapes can be inferred without explicit model access or training data.

Limitations: The exactness of genetic distance metrics may be sensitive to the choice of probing contexts and the granularity of token alignment. The phylogenetic assumption of tree-like evolution can break down due to fine-tuning and convergent model development, suggesting the need for alternative clustering or network-based representations (Yax et al., 6 Apr 2024).

2. Phylogenetic Latent Space Models for Network Data

The latent space interpretation of networks is augmented in (Pavone et al., 17 Feb 2025) by assigning each node a feature vector that is not only optimized for likelihood but also modeled as evolving along a phylogenetic tree via branching Brownian motion. This approach, labeled as a phylogenetic latent space model (“PhyloLM”), enables the joint inference of both latent positions and the multiscale modular hierarchy among nodes.

Key Model Elements:

  • Data Likelihood: Each edge yvuy_{vu} is Bernoulli with parameter θvu=expit(azvzu)\theta_{vu} = \text{expit}(a - \|z_v - z_u\|), encouraging connection probability to decrease with feature-space distance.
  • Phylogenetic Prior: Latent features zvz_v at the VV leaves are jointly distributed as NV(μ1V,σ2ΣΥ)N_V(\mu\, 1_V, \sigma^2 \Sigma_{\Upsilon}), where shared ancestry (encoded via the ultrametric rooted tree Υ\Upsilon) induces covariance Cov(zkv,zku)=σ2tvu\operatorname{Cov}(z_{kv}, z_{ku}) = \sigma^2 t_{vu}.
  • Tree Inference: The tree Υ\Upsilon is inferred via Bayesian sampling with Metropolis-within-Gibbs MCMC, using birth-death process priors and supporting flexible topologies.

Theoretical Guarantees:

  • Identifiability: For V>2K+1V > 2K + 1, the edge-probability structure is uniquely determined (up to rotation/translation of features), and the parameters (σ2,Υ)(\sigma^2, \Upsilon) are identifiable.
  • Posterior Consistency: For multiple networks on a fixed node set, the posterior concentrates on the true (σ2,Υ)(\sigma^2, \Upsilon) as the number of observed networks MM increases.

Multiscale Structure:

  • Branching events at the root isolate coarsest modules; subsequent splits define finer, nested submodules. Edge probabilities directly reflect these hierarchies via latent distances.

Empirical Demonstration:

Data Type Number of Nodes (V) Networks (M) Recovered Structure
’Ndrangheta criminal group 84 10 “Locali” clusters, role subclades, microstructure
Brain connectome 68 40 Frontal/posterior, hemispheric, lobar, limbic branches

This granularity in modularity is not recovered by blockmodels or standard (“flat prior”) latent spaces (Pavone et al., 17 Feb 2025).

Extensions: Potential generalizations include Poisson/count models for weighted graphs, trait–covariate hybrid trees, non-Euclidean latent spaces, dynamic/multilayer settings, and tree–blockmodel hybrids.

3. Phylogenetic Language Modeling in Genomics

The genomic PhyloLM framework unifies evolutionary modeling and deep learning for DNA sequence analysis, as realized in the PhyloGPN model of (Albors et al., 4 Mar 2025). Here, masked language modeling is reframed as predicting the parameters of an explicit phylogenetic substitution model.

Core Methodology:

  • Input: A one-hot encoded DNA window x(i)x^{(i)} (4 × 481 bp) from the human genome, with aligned columns y(i)y^{(i)} from up to 447 placental mammals, and the corresponding phylogenetic tree T(i)T^{(i)}.
  • Objective: Predict F81 model parameters θ=(θA,θC,θG,θT)\theta = (\theta_A, \theta_C, \theta_G, \theta_T) governing nucleotide substitution rates and the stationary distribution πa(θ)=softmax(θ)a\pi_a(\theta) = \text{softmax}(\theta)_a.
  • Phylogenetic Likelihood: The model maximizes the conditional log-likelihood PF81(y(i)θ,T(i))P_{F81}(y^{(i)}|\theta, T^{(i)}), using Felsenstein’s pruning algorithm for efficient marginalization over ancestral states.

Loss and Regularization:

  • The main loss combines negative conditional log-likelihood and a root-state regularizer to avoid degeneracy:

L0(W)=1ni=1nlogPF81(y(i)fW(x(i)),T(i)),L(W)=L0(W)+1ni=1nlogπref(i)\mathcal{L}_0(W) = -\frac{1}{n} \sum_{i=1}^n \log P_{F81}(y^{(i)}|f_W(x^{(i)}), T^{(i)}), \quad \mathcal{L}(W) = \mathcal{L}_0(W) + \frac{1}{n}\sum_{i=1}^n \log \pi_\text{ref}^{(i)}

  • To ensure numerical stability for large branch lengths, the probability-of-substitution α(t;θ)\alpha(t;\theta) is lower-bounded by sigmoid(logt+aθa)\text{sigmoid}(\log t + \sum_a \theta_a).

Architecture:

  • Adaptation of ByteNet/CARP backbone for 1D DNA sequence: 40 dilated residual conv blocks (kernel size 7, dilation exponential), embedding dimension 960, and enforced reverse-complement equivariance by tied weights.
  • \approx83M parameters (41M free after RCE).

Training Regime:

  • Data: 447-mammal Zoonomia whole-genome alignment (WGA) to the human reference; windows sampled to balance sex chromosomes and autosomes.
  • Optimizer: AdamW, LR 10510^{-5}, batch size 12 windows per GPU, trained on 4×A100 GPUs for 18 epochs.

Inference:

  • Once trained, PhyloGPN requires only the single-sequence window for prediction, enabling broad applicability without multi-species alignments at prediction time.

Performance Benchmarks:

Task Metric PhyloGPN Best Baseline
ClinVar (pathogenicity) AUROC 0.94–0.97 0.55–0.72
OMIM regulatory (pathogenic vs. common) AUPRC 0.25–0.45 0.10–0.30
DMS protein assays (average) Spearman ρ 0.25 ~0.05
BEND (gene finding) MCC 0.68 0.68
Chromatin accessibility AUROC 0.86 = or < 0.86
Disease VEP (embeddings) AUROC 0.96 0.77 (Nucleotide Transf.)

Ablations indicate robust generalization, with half-genome training sufficient to approach full-data performance.

Limitations: Short receptive fields (481 bp) limit long-range context utility, though pooling over longer spans (PhyloGPN-X) partially mitigates this. The F81 model, though tractable, is less expressive than more complex nucleotide substitution models (e.g., GTR), suggesting that further expressivity could capture richer context-dependent biases (Albors et al., 4 Mar 2025).

4. Comparative Table of PhyloLM Methodologies

Domain Structure Modeled Phylogenetic Mechanism Output/Insight
LLM Benchmarking (Yax et al., 6 Apr 2024) Model relationships Nei-style genetic distance; dendrogram Lineage, cluster, performance prediction
Network Latent Space (Pavone et al., 17 Feb 2025) Node features, modularity BBM on tree; latent feature evolution Hierarchies, uncertainty quantification
Genomic Modeling (Albors et al., 4 Mar 2025) Sequence evolution Markov substitution on known phylogeny Variant pathogenicity, functional annotation

5. Extensions, Limitations, and Outlook

PhyloLM, as reflected in these frameworks, demonstrates the power of phylogenetic abstraction: treating heterogeneous objects—be it neural models, network nodes, or DNA loci—as units with shared ancestry or behavior. Limitations are domain-specific: metric sensitivity and tree assumptions for model comparison (Yax et al., 6 Apr 2024), tractability and expressivity for network and genomic settings (Pavone et al., 17 Feb 2025, Albors et al., 4 Mar 2025). Extensions include incorporating broader or cross-domain data, more expressive or multitrait substitution processes, dynamic network and sequence evolution, and non-treelike or non-Euclidean relationships.

A plausible implication is that as phylogenetic methods broaden into machine learning and data science, they will provide principled frameworks for interpretability, zero-cost benchmarking, hierarchical modularity, and transfer learning. These advances continue to blur the boundaries between evolutionary theory and statistical learning.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PhyloLM.