Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nested Group Kernels in Hierarchical Models

Updated 29 January 2026
  • Nested group kernels are structured families of positive-definite functions that partition categorical levels into groups to model hierarchical interactions.
  • They impose a block covariance structure that enables parsimonious parameterization, scalable inference, and improved statistical stability in mixed-input regression.
  • They are applied in hierarchical Gaussian models and deep kernel architectures by seamlessly combining continuous and categorical kernels via block matrix formulations.

Nested group kernels are structured families of positive-definite kernel functions designed to model hierarchical interactions among input variables with group structure, primarily categorical variables with many levels. These kernel architectures impose a block structure on the covariance matrix, enabling parsimonious parameterization, scalable inference, and improved statistical stability. They have become central in Gaussian process (GP) regression for mixed continuous-categorical inputs, in hierarchical deep kernel learning, and as exact correspondents to group-pooling mechanisms in deep convolutional architectures.

1. Block-Matrix Formulation of Nested Group Kernels

A nested group kernel for a single categorical variable partitions the variable's LL levels into G≪LG\ll L disjoint groups G1,…,GGG_1,\dots,G_G, where ∣Gg∣=ng|G_g|=n_g and ∑g=1Gng=L\sum_{g=1}^G n_g=L (Roustant et al., 2018, Perez et al., 2 Oct 2025). The covariance matrix TT corresponding to the kernel is constructed as a G×GG\times G block matrix:

T=(W1B1,2⋯B1,G B2,1W2⋱⋮ ⋮⋱⋱BG−1,G BG,1⋯BG,G−1WG)T = \begin{pmatrix} W_1 & B_{1,2} & \cdots & B_{1,G} \ B_{2,1} & W_2 & \ddots & \vdots \ \vdots & \ddots & \ddots & B_{G-1,G} \ B_{G,1} & \cdots & B_{G,G-1} & W_G \end{pmatrix}

where for g≠hg\neq h, Bg,h=cg,hJng,nhB_{g,h} = c_{g,h} J_{n_g, n_h} and Jng,nhJ_{n_g, n_h} is the ng×nhn_g\times n_h all-ones matrix. Each diagonal block WgW_g encodes within-group covariances. In the minimal compound-symmetry (CS) case, Wg=vgIng+cg,gJngW_g=v_g I_{n_g} + c_{g,g} J_{n_g}, reflecting constant within-group variance (vgv_g) and covariance (cg,gc_{g,g}). The resulting induced kernel kcat(z,z′)=Tz,z′k_{\rm cat}(z,z')=T_{z,z'} may be combined with continuous-input kernels by product or sum.

For MM categorical factors or mixed inputs, the full kernel is

k((x,z),(x′,z′))=kcont(x,x′)×kcat(z,z′)k\left((x,z),(x',z')\right) = k_{\rm cont}(x,x') \times k_{\rm cat}(z,z')

where kcatk_{\rm cat} is constructed as above for the categorical component (Perez et al., 2 Oct 2025).

2. Hierarchical Statistical Interpretation and Generalization

Nested group kernels correspond exactly to hierarchical (multi-level) Gaussian models. The group/level structure arises as the covariance of ηg,i=δg+γg,i\eta_{g,i} = \delta_g + \gamma_{g,i}, where δg\delta_g are group means and γg,i\gamma_{g,i} are centered level effects (∑iγg,i=0\sum_i \gamma_{g,i}=0) (Roustant et al., 2018):

  • Between-group covariance: Cov(δg,δh)=Σg[g,h]\mathrm{Cov}(\delta_g, \delta_h) = \Sigma_g[g,h]
  • Within-group: Cov(γg,i,γg,j)\mathrm{Cov}(\gamma_{g,i}, \gamma_{g,j}), parameterized by MgM_g (on the Helmert basis), or more generally by any positive-semidefinite, centered matrix.

Relaxing the CS assumption, each WgW_g may be any positive semi-definite ng×ngn_g\times n_g block satisfying Wg−WˉgJng⪰0W_g-\bar W_g J_{n_g} \succeq 0, where Wˉg=(ng)−21TWg1\bar W_g=(n_g)^{-2} \mathbf{1}^T W_g \mathbf{1}. This yields the Generalized Compound-Symmetry (GCS) block structure (Roustant et al., 2018, Perez et al., 2 Oct 2025):

Wg=Bg,g∗Jng,ng+AgMgAgTW_g = B^*_{g,g} J_{n_g, n_g} + A_g M_g A_g^T

where AgA_g is any orthonormal basis for the hyperplane orthogonal to 1ng\mathbf{1}_{n_g} and MgM_g is centered.

3. Positive-Definiteness via Averaged Block Matrix

The crucial positive-definiteness test for TT relies on the associated G×GG\times G "block-mean" matrix T~\widetilde{T} (Roustant et al., 2018, Perez et al., 2 Oct 2025):

T~g,h=1ngnh∑i∈Gg∑j∈GhT(g,i),(h,j)\widetilde T_{g,h} = \frac{1}{n_g n_h} \sum_{i \in G_g} \sum_{j \in G_h} T_{(g,i),(h,j)}

Theorem (Roustant et al.): T⪰0T \succeq 0 iff T~⪰0\widetilde T \succeq 0; T≻0T \succ 0 iff T~≻0\widetilde T \succ 0 and each diagonal block Wg≻0W_g \succ 0.

This result ensures scalability: inference and positivity checks reduce to matrices of size GG (number of groups), not LL (number of levels), critically enabling use with variables of dozens or hundreds of levels grouped into relatively few blocks.

4. Clustering-Based Nested Group Kernels for Unknown Structures

When group structure is unknown, levels are partitioned using automatic clustering (Perez et al., 2 Oct 2025). Each level cc is encoded by its observed conditional mean and standard deviation (MSD encoding):

ψ(c)=(μc,σc),μc=1Nc∑i:z(i)=cy(i), σc=1Nc∑i:z(i)=c(y(i)−μc)2\psi(c) = (\mu_c, \sigma_c), \quad \mu_c = \frac{1}{N_c}\sum_{i:z^{(i)}=c} y^{(i)},\, \sigma_c = \sqrt{ \frac{1}{N_c} \sum_{i:z^{(i)}=c}(y^{(i)}-\mu_c)^2 }

A distance is defined between levels, and hierarchical clustering is applied, with the number of clusters QQ selected by maximizing the mean Silhouette score over 2≤Q≤C−12 \le Q \le C-1. The resulting partition (G^1,...,G^Q)(\hat{\mathcal{G}}_1, ..., \hat{\mathcal{G}}_Q) is used to instantiate the block kernel TT in nested structure.

Alternatively, pseudo-distances from a pilot kernel (e.g., LVGP) may inform clustering, provided the kernel supplied is positive-definite.

5. Algorithmic Implementation and Hyperparameter Estimation

Parameter estimation for nested group kernels involves learning both continuous and categorical kernel components. For the categorical blocks:

  • CS parameterization: 2 parameters per group (within-group), plus G(G−1)/2G(G-1)/2 parameters (between-group), or more generally, entries of MgM_g (centered covariance) and B∗B^*.
  • Optimization: Maximization of log-marginal likelihood using L-BFGS-B (SciPy), with extensive multi-start restarts (up to 96), either in "long" or "short" settings depending on computational constraints (Perez et al., 2 Oct 2025).

For deep kernel architectures (SVM context), nested grouping is achieved by stacking several layers, each linearly combining groupwise kernels from the previous layer:

K(ℓ)(x,y)=∑h=1H∑j=1mθh,j(ℓ)Kh,j(ℓ−1)(x,y)K^{(\ell)}(x,y) = \sum_{h=1}^H \sum_{j=1}^m \theta_{h,j}^{(\ell)} K_{h,j}^{(\ell-1)}(x,y)

with nonnegativity constraints and optional normalization. The optimization objective minimizes a smoothed span-bound, an upper bound on SVM leave-one-out error, offering superior generalization on limited data (Strobl et al., 2013).

For DCN-style group kernels, each layer pools responses over group transformations, invoking hierarchical composition of group-averaged arc-cosine kernels. The layer-â„“\ell kernel recursively composes group averages and rectified nonlinearities (Anselmi et al., 2015).

6. Empirical Performance and Applications

Empirical studies demonstrate robust advantages for nested group kernels across tasks with categorical variables exhibiting group structure (Roustant et al., 2018, Perez et al., 2 Oct 2025). Notable results include:

  • Outperformance of one-hot and CS kernels in both accuracy (median RRMSE) and efficiency (Parsimonious parameter counts: e.g., ~30 nested-group parameters vs ~4000 in full covariance for a 94-level factor) (Roustant et al., 2018).
  • Effective automatic group extraction for unknown structures, placing Nested He/He + MSD encoding, or alternative clustering-based nested kernels, on the Pareto front in accuracy/training time tradeoffs.
  • Consistent empirical dominance in Area Under Curve (AUC) of performance profiles across diverse datasets.
  • In the nuclear waste inverse problem, block-structured modeling yielded +10% relative Q2Q^2 improvement on test points, with greater stability across training splits.

For deep multiple kernel learning, increasing nesting layers yields incremental accuracy improvement, with diminishing but non-negligible returns, and well-controlled generalization even with only a few base kernels per layer (Strobl et al., 2013).

In hierarchical convolutional networks, the nested group kernel formalism provides an exact algebraic equivalence for invariance/selectivity and memory minimization conjectured for DCN pooling layers (Anselmi et al., 2015).

7. Significance and Theoretical Implications

Nested group kernels unify statistical, algebraic, and learning-theoretic perspectives:

  • They allow for both positive and negative within-group correlations, supporting richer random-effects and prior structures.
  • Positivity conditions and kernel validity are efficiently enforced at the group-mean (block-average) level, ensuring practicality for high-cardinality categorical features.
  • Hierarchical nesting, whether in block matrices for GP regression or layers in deep kernel networks, preserves positive definiteness and enables selective/invariant representations with minimal memory cost.
  • The approach facilitates integration with any continuous-input kernel via product, sum, or ANOVA mechanisms.

A plausible implication is that such kernels, by leveraging block-structured regularization and efficient parameterization, provide a principled route to scalable, interpretable, and statistically efficient model construction for both GP and deep kernel learners in settings with complex categorical data.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nested Group Kernels.