Nested Group Kernels in Hierarchical Models

Updated 29 January 2026

Nested group kernels are structured families of positive-definite functions that partition categorical levels into groups to model hierarchical interactions.
They impose a block covariance structure that enables parsimonious parameterization, scalable inference, and improved statistical stability in mixed-input regression.
They are applied in hierarchical Gaussian models and deep kernel architectures by seamlessly combining continuous and categorical kernels via block matrix formulations.

Nested group kernels are structured families of positive-definite kernel functions designed to model hierarchical interactions among input variables with group structure, primarily categorical variables with many levels. These kernel architectures impose a block structure on the covariance matrix, enabling parsimonious parameterization, scalable inference, and improved statistical stability. They have become central in Gaussian process (GP) regression for mixed continuous-categorical inputs, in hierarchical deep kernel learning, and as exact correspondents to group-pooling mechanisms in deep convolutional architectures.

1. Block-Matrix Formulation of Nested Group Kernels

A nested group kernel for a single categorical variable partitions the variable's $L$ levels into $G\ll L$ disjoint groups $G_1,\dots,G_G$ , where $|G_g|=n_g$ and $\sum_{g=1}^G n_g=L$ (Roustant et al., 2018, Perez et al., 2 Oct 2025). The covariance matrix $T$ corresponding to the kernel is constructed as a $G\times G$ block matrix:

$T = \begin{pmatrix} W_1 & B_{1,2} & \cdots & B_{1,G} \ B_{2,1} & W_2 & \ddots & \vdots \ \vdots & \ddots & \ddots & B_{G-1,G} \ B_{G,1} & \cdots & B_{G,G-1} & W_G \end{pmatrix}$

where for $g\neq h$ , $B_{g,h} = c_{g,h} J_{n_g, n_h}$ and $J_{n_g, n_h}$ is the $n_g\times n_h$ all-ones matrix. Each diagonal block $W_g$ encodes within-group covariances. In the minimal compound-symmetry (CS) case, $W_g=v_g I_{n_g} + c_{g,g} J_{n_g}$ , reflecting constant within-group variance ( $v_g$ ) and covariance ( $c_{g,g}$ ). The resulting induced kernel $k_{\rm cat}(z,z')=T_{z,z'}$ may be combined with continuous-input kernels by product or sum.

For $M$ categorical factors or mixed inputs, the full kernel is

$k\left((x,z),(x',z')\right) = k_{\rm cont}(x,x') \times k_{\rm cat}(z,z')$

where $k_{\rm cat}$ is constructed as above for the categorical component (Perez et al., 2 Oct 2025).

2. Hierarchical Statistical Interpretation and Generalization

Nested group kernels correspond exactly to hierarchical (multi-level) Gaussian models. The group/level structure arises as the covariance of $\eta_{g,i} = \delta_g + \gamma_{g,i}$ , where $\delta_g$ are group means and $\gamma_{g,i}$ are centered level effects ( $\sum_i \gamma_{g,i}=0$ ) (Roustant et al., 2018):

Between-group covariance: $\mathrm{Cov}(\delta_g, \delta_h) = \Sigma_g[g,h]$
Within-group: $\mathrm{Cov}(\gamma_{g,i}, \gamma_{g,j})$ , parameterized by $M_g$ (on the Helmert basis), or more generally by any positive-semidefinite, centered matrix.

Relaxing the CS assumption, each $W_g$ may be any positive semi-definite $n_g\times n_g$ block satisfying $W_g-\bar W_g J_{n_g} \succeq 0$ , where $\bar W_g=(n_g)^{-2} \mathbf{1}^T W_g \mathbf{1}$ . This yields the Generalized Compound-Symmetry (GCS) block structure (Roustant et al., 2018, Perez et al., 2 Oct 2025):

$W_g = B^*_{g,g} J_{n_g, n_g} + A_g M_g A_g^T$

where $A_g$ is any orthonormal basis for the hyperplane orthogonal to $\mathbf{1}_{n_g}$ and $M_g$ is centered.

3. Positive-Definiteness via Averaged Block Matrix

The crucial positive-definiteness test for $T$ relies on the associated $G\times G$ "block-mean" matrix $\widetilde{T}$ (Roustant et al., 2018, Perez et al., 2 Oct 2025):

$\widetilde T_{g,h} = \frac{1}{n_g n_h} \sum_{i \in G_g} \sum_{j \in G_h} T_{(g,i),(h,j)}$

Theorem (Roustant et al.): $T \succeq 0$ iff $\widetilde T \succeq 0$ ; $T \succ 0$ iff $\widetilde T \succ 0$ and each diagonal block $W_g \succ 0$ .

This result ensures scalability: inference and positivity checks reduce to matrices of size $G$ (number of groups), not $L$ (number of levels), critically enabling use with variables of dozens or hundreds of levels grouped into relatively few blocks.

4. Clustering-Based Nested Group Kernels for Unknown Structures

When group structure is unknown, levels are partitioned using automatic clustering (Perez et al., 2 Oct 2025). Each level $c$ is encoded by its observed conditional mean and standard deviation (MSD encoding):

$\psi(c) = (\mu_c, \sigma_c), \quad \mu_c = \frac{1}{N_c}\sum_{i:z^{(i)}=c} y^{(i)},\, \sigma_c = \sqrt{ \frac{1}{N_c} \sum_{i:z^{(i)}=c}(y^{(i)}-\mu_c)^2 }$

A distance is defined between levels, and hierarchical clustering is applied, with the number of clusters $Q$ selected by maximizing the mean Silhouette score over $2 \le Q \le C-1$ . The resulting partition $(\hat{\mathcal{G}}_1, ..., \hat{\mathcal{G}}_Q)$ is used to instantiate the block kernel $T$ in nested structure.

Alternatively, pseudo-distances from a pilot kernel (e.g., LVGP) may inform clustering, provided the kernel supplied is positive-definite.

5. Algorithmic Implementation and Hyperparameter Estimation

Parameter estimation for nested group kernels involves learning both continuous and categorical kernel components. For the categorical blocks:

CS parameterization: 2 parameters per group (within-group), plus $G(G-1)/2$ parameters (between-group), or more generally, entries of $M_g$ (centered covariance) and $B^*$ .
Optimization: Maximization of log-marginal likelihood using L-BFGS-B (SciPy), with extensive multi-start restarts (up to 96), either in "long" or "short" settings depending on computational constraints (Perez et al., 2 Oct 2025).

For deep kernel architectures (SVM context), nested grouping is achieved by stacking several layers, each linearly combining groupwise kernels from the previous layer:

$K^{(\ell)}(x,y) = \sum_{h=1}^H \sum_{j=1}^m \theta_{h,j}^{(\ell)} K_{h,j}^{(\ell-1)}(x,y)$

with nonnegativity constraints and optional normalization. The optimization objective minimizes a smoothed span-bound, an upper bound on SVM leave-one-out error, offering superior generalization on limited data (Strobl et al., 2013).

For DCN-style group kernels, each layer pools responses over group transformations, invoking hierarchical composition of group-averaged arc-cosine kernels. The layer- $\ell$ kernel recursively composes group averages and rectified nonlinearities (Anselmi et al., 2015).

6. Empirical Performance and Applications

Empirical studies demonstrate robust advantages for nested group kernels across tasks with categorical variables exhibiting group structure (Roustant et al., 2018, Perez et al., 2 Oct 2025). Notable results include:

Outperformance of one-hot and CS kernels in both accuracy (median RRMSE) and efficiency (Parsimonious parameter counts: e.g., ~30 nested-group parameters vs ~4000 in full covariance for a 94-level factor) (Roustant et al., 2018).
Effective automatic group extraction for unknown structures, placing Nested He/He + MSD encoding, or alternative clustering-based nested kernels, on the Pareto front in accuracy/training time tradeoffs.
Consistent empirical dominance in Area Under Curve (AUC) of performance profiles across diverse datasets.
In the nuclear waste inverse problem, block-structured modeling yielded +10% relative $Q^2$ improvement on test points, with greater stability across training splits.

For deep multiple kernel learning, increasing nesting layers yields incremental accuracy improvement, with diminishing but non-negligible returns, and well-controlled generalization even with only a few base kernels per layer (Strobl et al., 2013).

In hierarchical convolutional networks, the nested group kernel formalism provides an exact algebraic equivalence for invariance/selectivity and memory minimization conjectured for DCN pooling layers (Anselmi et al., 2015).

7. Significance and Theoretical Implications

Nested group kernels unify statistical, algebraic, and learning-theoretic perspectives:

They allow for both positive and negative within-group correlations, supporting richer random-effects and prior structures.
Positivity conditions and kernel validity are efficiently enforced at the group-mean (block-average) level, ensuring practicality for high-cardinality categorical features.
Hierarchical nesting, whether in block matrices for GP regression or layers in deep kernel networks, preserves positive definiteness and enables selective/invariant representations with minimal memory cost.
The approach facilitates integration with any continuous-input kernel via product, sum, or ANOVA mechanisms.

A plausible implication is that such kernels, by leveraging block-structured regularization and efficient parameterization, provide a principled route to scalable, interpretable, and statistically efficient model construction for both GP and deep kernel learners in settings with complex categorical data.

Markdown Upgrade to Chat

References (4)

Group kernels for Gaussian process metamodels with categorical inputs (2018)

A reproducible comparative study of categorical kernels for Gaussian process regression, with new clustering-based nested kernels (2025)

Deep Multiple Kernel Learning (2013)

Deep Convolutional Networks are Hierarchical Kernel Machines (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nested Group Kernels.