Weisfeiler-Lehman Kernel Overview

Updated 23 February 2026

Weisfeiler-Lehman Kernel is a graph similarity measure based on iterative color refinement that encodes local and global graph patterns.
It extends basic subtree counting to incorporate optimal assignment, soft similarity, and contextual insights for improved expressiveness.
Efficient implementations leverage hashing and explicit feature maps, while advanced variants address continuous attributes and scalability challenges.

The Weisfeiler-Lehman (WL) kernel is a central paradigm in graph learning, originating from graph isomorphism heuristics and broadly employed as a graph similarity measure in kernel-based machine learning. This family of kernels leverages the iterative Weisfeiler-Lehman color refinement procedure to construct high-dimensional, count-based, and often explicitly computable feature maps with strong insights into local and global graph structure. Modern developments generalize this approach along several axes—moving from strict subtree pattern counting to optimal assignment, optimal transport, and soft-similarity regimes—while addressing limitations in expressiveness, scalability, and handling continuous or attributed graphs.

At its core, the WL kernel is built upon the 1-dimensional WL (1-WL) color refinement process. For a labeled graph $G = (V,E, \ell_0)$ , the algorithm conducts $h$ rounds of node label updates, each round replacing every node's label with a compressed representation that incorporates its current label and the multiset of its neighbors' latest labels: $\ell_{i+1}(v) = \mathrm{hash}\bigl(\ell_i(v),\{\!\!\{\ell_i(u): u \in N(v)\}\!\!\}\bigr)$ with a suitable injective hashing or relabeling function ensuring label uniqueness.

After $h$ iterations, for each node $v \in V$ , the labels $\ell_0(v), \ldots, \ell_h(v)$ encode increasingly large rooted subtree neighborhoods. The standard WL feature map $\varphi(G)$ is constructed by collecting—for every label at every iteration—how many nodes in $G$ exhibit that label. The classical WL kernel between graphs $G$ and $G'$ is: $K_{\rm WL}(G, G') = \langle \varphi(G), \varphi(G') \rangle = \sum_{i=0}^h \sum_{\sigma \in \Sigma^i} \# \{v : \ell_i(v) = \sigma\} \cdot \# \{v': \ell_i(v') = \sigma\}$ where $\Sigma^i$ is the set of all labels encountered at iteration $i$ (Narayanan et al., 2016, Kriege, 2022, Togninalli et al., 2019).

This approach is computationally efficient: each iteration costs $O(|E|)$ (more precisely, $O(|E| \log d_{\max})$ if neighbor multisets are sorted), and the explicit feature map allows for rapid kernel matrix construction (Narayanan et al., 2016).

2. Extensions: Expressiveness, Context and Assignments

The structural features encoded by 1-WL are limited in that some non-isomorphic graphs (notably regular and certain symmetric graphs) are not distinguishable. To overcome this, several approaches extend the core WL kernel framework:

Contextual WL Kernel (CWLK): Integrates additional context information (such as user-awareness in malware detection graphs) into the label updates by prefixing or embedding context tags, yielding more expressive kernels particularly suited for attributed graphs with side-information. The CWLK does not significantly increase the computational overhead (Narayanan et al., 2016).
Higher-Order and Generalized WL: The $k$ -WL framework updates labels of $k$ -tuples rather than single nodes, thereby capturing higher-order interactions among nodes but suffering from exponential growth in computational costs. To mitigate this, local formulations (e.g., 8-2-LWL⁺) restrict neighborhood consideration to actual edges rather than all possible replacements, maintaining much of the higher-order expressivity with tractable complexity (Morris et al., 2019).
Optimal Assignment Kernels: Rather than using simple bag-of-labels inner products, these kernels compute the maximal matching or assignment of vertex features between graphs, possibly incorporating label hierarchies and learned weights, as in the Deep WL Optimal Assignment kernel. These formulations are made positive semidefinite (PSD) by restricting base kernels to those induced by hierarchies (“strong” kernels), and may be efficiently realized via histogram intersections across the hierarchy (Kriege, 2019, Bause et al., 2022).
Soft and Similarity-Aware Comparisons: Generalizations replace the hard equality in subtree comparison with measures such as tree edit-distances, quantized Wasserstein distances, or filtered existence intervals, yielding a continuum from highly rigid to fully soft similarity measures and strictly increasing the expressive capacity on difficult benchmarks (Schulz et al., 2021, Schulz et al., 2021, Togninalli et al., 2019).

3. Handling Continuous Attributes, Filtration, and Subgraphs

Classical WL is agnostic to continuous node attributes and edge weights. Several recent approaches resolve this:

Wasserstein and Sliced-Wasserstein WL Kernels: The Wasserstein WL (WWL) family generalizes node label propagation to node feature diffusion with linear updates, concatenates multiple iterations, and compares distributions of final embeddings via Wasserstein or Sliced Wasserstein metrics. The Sliced-Wasserstein variant (SWWL) reduces the computational complexity from cubic (for full Wasserstein) to near linear (in the number of nodes and projections), rigorously preserving positive-definiteness and enabling scalability to extremely large graphs (Togninalli et al., 2019, Perez et al., 2024).
Graph Filtration Kernels: Filtration-based kernels generate a sequence of nested subgraphs (for example, by edge-weight or connectivity thresholds), track the birth/death intervals of each feature (e.g., a WL label), and compare graphs on the basis of “persistence” of labels across filtrations using stable metrics such as 1-Wasserstein distances on feature histograms. This procedure can be proven to strictly increase isomorphism-discriminating power over vanilla WL and can be computed in $O(h k m)$ for $k$ filtration steps (Schulz et al., 2021).
WL Kernels for Subgraph Analysis: The WLKS framework extends WL to subgraph-level tasks by applying color refinement to subgraph-centric induced neighborhoods up to $k$ hops, and then combines kernels for multiple radii to close expressiveness gaps. Selection of appropriate radii (e.g., $\{0, D\}$ , with $D$ the graph diameter) provides a favorable expressiveness-efficiency tradeoff, outperforming competing graph neural network-based baselines in subgraph classification tasks (Kim et al., 2024).

4. Adaptive and Learned Feature Weighting

A common theme is the refinement of feature importances beyond uniform label counts. Weighted WL kernels incorporate learnable weights $w_\sigma$ for each subtree pattern, either via direct convex metric-learning objectives (stochastic projected gradient descent with provable generalization bounds) or multiple kernel learning over hierarchical label groupings. Learned weights are typically sparse, favoring generalizable and interpretable models while down-weighting over-specific rare patterns that may overfit (Nguyen et al., 2021, Kriege, 2019).

Empirical studies demonstrate on molecular, social network, and synthetic benchmarks that adaptive weighting of subtree features yields consistent, statistically significant improvements in graph classification accuracy, robust interpretability, and scalability to thousands of graphs (Nguyen et al., 2021, Kriege, 2019).

5. Connections to Other Kernel Paradigms

The WL kernel is intricately connected to other prominent graph kernel motifs:

Random Walk Kernels: The classical random walk kernel class counts matching walks in direct product graphs. By incorporating fine-grained node-centric walk profile comparison and appropriate strictness parameters, these walk-based kernels can match or surpass WL expressiveness while admitting a continuum from soft to strict matching. Efficient computation is achieved via kernel-trick/factorized product graph approaches (Kriege, 2022).
Neighborhood-Preserving and Product-Graph Kernels: Some variants (e.g., Neighborhood-Preserving kernels) recover the WL subtree kernel exactly as a special case in a broader R-convolution setting when restricted to self-loop “edges” and constant attribute kernels, embedding the WL kernel firmly in Haussler’s R-convolution framework. The recursive product graph formulation also offers a compact perspective on the iterative color refinement of WL (Salim et al., 2020).
Edit and Tree Kernel Connections: Generalized WL kernels match subtree patterns via tree edit distances or other structure-aware similarity metrics, providing robustness to noise, continuous features, and high-density motifs that often defeat strict label-based matching (Schulz et al., 2021, Schulz et al., 2021).

6. Computational Complexity and Practical Performance

All major WL kernel variants share core efficiency properties: the base 1-WL kernel admits per-iteration costs linear or near-linear in the number of edges. Efficient implementations leverage hashing, sorting, and explicit feature representations, with extensions for attributed graphs (continuous attributes, edge weights) incurring only moderate polynomial overhead.

More complex kernels (e.g., those using optimal transport, assignment, or generalized edit distances) have cost quadratic or cubic in the number of nodes or patterns, but can be made tractable via histogram-based shortcuts, Sliced-Wasserstein projections, or product-graph sparsification (Perez et al., 2024, Kriege, 2019, Schulz et al., 2021).

Empirical results on standard benchmarks consistently show that WL-based kernels set the baseline for accuracy, scalability, and interpretability in graph classification and are frequently competitive or dominant relative to more complex neural architectures, especially for small to medium-sized graphs (Morris et al., 2019, Togninalli et al., 2019, Schulz et al., 2021, Kim et al., 2024).

7. Theoretical Expressiveness and Limitations

The expressiveness of WL kernels is formally characterized in terms of their indistinguishability with respect to graph isomorphism. The vanilla 1-WL is known to fail on certain graph pairs (e.g., regular or CFI graphs), but higher-order generalizations, graph filtrations, and assignment/proximity-based kernels can match or exceed 1-WL’s distinguishing power. Theoretical analyses (e.g., completeness via filtrations, hierarchy-weighted assignment kernels) establish which extensions are true strict generalizations and under what structural conditions expressiveness is increased (Schulz et al., 2021, Morris et al., 2019, Kriege, 2019).

Limitations persist in computational cost for higher-order or global variants and in overfitting for very large feature spaces; sparsification, feature grouping, and adaptive weighting schemes have been proposed to address these issues.

Key References:

(Narayanan et al., 2016) Contextual WL Kernel (CWLK) for malware detection
(Morris et al., 2019) Sparse and scalable higher-order (k-WL) kernels
(Kriege, 2019) Deep WL optimal assignment kernels with MKL-learned hierarchy weights
(Togninalli et al., 2019) Wasserstein WL kernels for attributed graphs
(Nguyen et al., 2021) Learning subtree pattern importance in weighted WL
(Schulz et al., 2021) Filtration-based WL graph kernels
(Kim et al., 2024) WLKS: Generalizing WL Kernels to Subgraphs
(Bause et al., 2022) Gradual Weisfeiler-Leman kernels
(Schulz et al., 2021) Generalized WL kernels with tree edit similarities
(Perez et al., 2024) Sliced Wasserstein WL kernel for large attributed graphs
(Kriege, 2022) Random walk kernels revisited: strictness and WL equivalence
(Salim et al., 2020) Neighborhood preserving kernel framework
(Martino et al., 2015) Extensions of WL kernels with higher-radius neighborhoods

These works provide a comprehensive theoretical and empirical foundation for the Weisfeiler-Lehman kernel and its ongoing evolution within graph representation learning.