Evolutionary Tree of LLMs

Updated 5 October 2025

Evolutionary Tree of LLMs is a framework that maps the genealogy of large language models using concepts from genetics, phylogenetics, and clustering.
Methodologies extract key features through token probabilities, semantic embeddings, and TF-IDF, which are then used to construct phylogenetic trees and reveal hidden model lineages.
Practical applications include benchmarking model performance, guiding model selection, and uncovering undocumented relationships to enhance trust and management in dynamic model ecosystems.

LLMs have proliferated rapidly since late 2022, resulting in a diverse and dynamic landscape that is both technically complex and evolutionarily opaque. Tracing the evolutionary relationships among thousands of LLMs—spanning different architectures, training regimes, and adaptation strategies—has become a central research challenge. Multiple methodologies have been introduced, inspired by genetics, phylogenetics, and clustering theory, to map, analyze, and understand these model lineages and their functional divergence.

1. Definitions and Formal Analogies

The notion of an "evolutionary tree" for LLMs captures the genealogy and relatedness between models based on function, architecture, and adaptation. The analogy to genetics is central in several recent works.

In PhyloLM (Yax et al., 6 Apr 2024), the LLM's output under specific contexts (termed “gene”) is treated as an allele, forming a “population” described by $P(t|c)$ , the conditional probability of a token under context c. Model relatedness is computed using Nei’s genetic distance, a standard in population genetics, reformulated for LLMs:

$S(P_1, P_2) = \frac{\sum_{g \in G} \sum_{a \in A_g} P_1(a|g) P_2(a|g)}{\sqrt{\left[\sum_{g \in G} \sum_{a \in A_g} P_1(a|g)^2 \right]\left[\sum_{g \in G} \sum_{a \in A_g} P_2(a|g)^2\right]}}$

The phylogenetic distance is then $D(P_1, P_2) = -\log(S(P_1, P_2))$ .

LLM DNA (Wu et al., 29 Sep 2025) formalizes a model's functional behavior as a vector τ₍f₎ in a low-dimensional, bi-Lipschitz space:

$c_1 d_H(f_1, f_2) \le \|\tau_{f_1} - \tau_{f_2}\|_2 \le c_2 d_H(f_1, f_2)$

$d_H$ is the Hilbert-space metric between models' outputs over a sampled distribution of prompts. This rigorous mapping enables robust phylogenetic analyses in high-throughput, training-free settings.

2. Data Extraction and Feature Construction

Various techniques are employed for extracting meaningful features from LLMs:

In On the Origin of LLMs (Gao et al., 2023), feature extraction from 15,821 Hugging Face models proceeds via n-gram tokenization of standardized model names. TF-IDF weighting enhances discriminative substrings, enabling subsequent clustering.
LLM DNA (Wu et al., 29 Sep 2025) builds on semantic embeddings using sentence-transformers, concatenating a set of prompt-response pairs per model to obtain high-dimensional functional representations, then projecting down via random Gaussian matrices justified by the Johnson–Lindenstrauss lemma.
PhyloLM (Yax et al., 6 Apr 2024) ensures that tasks (genes) are extracted from non-contaminated datasets (such as open-web-math, MBXP) so outputs do not leak training-set correlations.

This stage is crucial for all methodologies, as the quality and scope of extracted features directly impact the eventual grouping, lineage inference, and functional predictions.

3. Clustering and Phylogenetic Tree Construction

Clustering and tree-construction algorithms transform extracted features into concrete genealogical relations:

Methodology	Feature Type	Clustering/Tree Construction
TF-IDF Name Model	N-gram substrings	Hierarchical, agglomerative, Louvain communities
Output Genome (PhyloLM)	Token probabilities	NJ dendrogram on Nei distance
Functional Embedding (LLM DNA)	Semantic vectors	NJ tree using Euclidean DNA distance

Hierarchical clustering (single-linkage, cosine distance) in (Gao et al., 2023) enables dendrogram visualizations revealing model subgroups by backbone, size, or training convention.
Louvain method (Gao et al., 2023) clusters the graph of pairwise high similarity.
PhyloLM (Yax et al., 6 Apr 2024) and LLM DNA (Wu et al., 29 Sep 2025) construct trees with the Neighbor-Joining algorithm, minimizing global branch length to reflect “evolutionary effort.”
Agglomerative methods and spectral clustering are used for semantic decomposition, as in SELT (Wu et al., 9 Jun 2025), to break reasoning tasks into meaningful atomic groups for further analysis.

4. Evolutionary Insights: Speed, Lineage, and Undocumented Relationships

These evolutionary trees yield diverse, nontrivial insights about LLM development:

In LLM DNA (Wu et al., 29 Sep 2025), branch lengths quantify “evolutionary speed”—Qwen and Gemma families exhibit longer branches (faster functional changes) compared to Llama. The tree visually separates early encoder–decoder models from recent decoder-only models, confirming known architectural shifts.
PhyloLM (Yax et al., 6 Apr 2024) recaptures known training and finetuning lineages, clustering Llama sub-branches by version and finetuning set.
DNA-based distances uncover previously undocumented relationships (e.g., close fine-tuning proximity of models that Hugging Face documentation does not explicitly connect).
On the Origin of LLMs (Gao et al., 2023) reveals systematic naming conventions strongly correlate with architectural and training similarities despite not utilizing direct model internals.

A plausible implication is that evolutionary tree analysis can identify “hidden lineages” and verify claimed ancestry or functional inheritance—crucial for robust model management and trust.

5. Practical Applications and Benchmark Prediction

Phylogenetic and DNA-based representations have operational significance:

PhyloLM (Yax et al., 6 Apr 2024) demonstrates that the location of a model in phylogenetic space strongly correlates with standardized benchmark performance (MMLU, ARC). Via Multidimensional Scaling (MDS) and an MLP regressor, quantitative performance prediction is tractable and correlates with output genome distance.
LLM DNA (Wu et al., 29 Sep 2025) achieves competitive or superior task clustering and predictive power compared to prior lineage-based studies, even on large diverse model sets.
Constellation (Gao et al., 2023) enables real-time search, indexing, and exploration of model relationships for practitioners (e.g., model selection guided by evolutionary proximity).

This suggests that functional lineage extraction may serve as a cost-effective proxy for exhaustive benchmarking and as a tool for performance forecasting.

6. Extensions: Agent-Based Models and Evolutionary Algorithms

Recent works have extended the evolutionary tree paradigm to domains beyond model selection:

Evolutionary thoughts (Yepes et al., 9 May 2025) integrates LLMs with evolutionary algorithms (EAs), using LLMs for task-specific seed generation and guided mutation. This improves convergence, yields efficient candidate programs, and allows for adaptive exploration in complex search spaces via LLM feedback.
Evolutionary ecology of words (Suzuki et al., 9 May 2025) models word evolution in agent-based spatial ecosystems, with agents competing, mutating, and adapting words/phrases via LLM mediation. The evolutionary dynamics produce open-ended diversity and punctuated equilibria similar to natural ecosystems, offering new paradigms for studying the evolution of linguistic behaviors and model strategies.

A plausible implication is that viewing LLMs as agents or meta-search heuristics opens avenues for exploring their adaptability, resilience, and emergent behaviors in synthetic evolutionary contexts.

7. Limitations and Future Directions

Each methodological approach exhibits specific constraints:

Analysis based on model names (e.g., (Gao et al., 2023)) depends on systematic, non-colliding nomenclature; this may fail where naming conventions are arbitrary or underspecified.
PhyloLM’s output-based genome approaches (Yax et al., 6 Apr 2024) avoid model internals but require substantial query budgets and careful task selection to avoid contamination.
LLM DNA (Wu et al., 29 Sep 2025) assumes the embedding method captures functional similarity robustly, a nontrivial technical requirement given the diversity of LLM outputs.

Directions for refinement include integrating training metadata, expanding functional benchmarks, enhancing embedding techniques, and deploying trees as live “atlases” that update in concert with the fast-moving LLM field. Broadly, the transfer of genetic and phylogenetic methodologies to LLMs is facilitating rigorous, scalable approaches for understanding and managing the evolutionary dynamics of large scale model ecosystems.