Hierarchical Decomposition of Kernel Matrices

Updated 1 April 2026

Hierarchical decomposition of kernel matrices is a method that recursively partitions large matrices into blocks and approximates far-field blocks with low-rank factorizations.
This approach achieves near-linear storage and computational complexity through adaptive refinement and strict error control while preserving key spectral properties.
It underpins scalable algorithms for applications in numerical PDEs, Gaussian processes, boundary element methods, and neural tangent kernel compression using techniques like ACA, ID, and skeletonization.

The hierarchical decomposition of kernel matrices is a class of matrix approximation techniques that exploit the blockwise, data-dependent low-rank structure often present in large kernel matrices arising from numerical partial differential equations, integral equations, and machine learning. By recursively partitioning the matrix into smaller blocks based on geometric or data-driven criteria and efficiently approximating off-diagonal (far-field) blocks with low-rank factorizations, hierarchical matrix (H-matrix), H²-matrix, and related frameworks enable near-linear storage and computational complexity. These methods provide error guarantees, adaptive refinement, and can preserve key spectral or structural properties of the original matrices, making them foundational in scalable algorithms for dense numerical linear algebra, Gaussian processes, kernel regression, boundary element methods, and neural tangent kernel compression.

1. Hierarchical Partitioning and Admissibility

Hierarchical decomposition starts with a recursive partition of the index set associated with matrix rows and columns. Typically, a binary or multiway (e.g., k-d) tree (cluster tree) is constructed, organizing data points or degrees of freedom into clusters at progressively finer scales. The fundamental concept is a block-cluster tree: every node in the cluster tree represents a set of indices, and the Cartesian product of parent/child clusters defines matrix blocks (submatrices).

Blocks are classified as admissible (far-field) or inadmissible (near-field) by an admissibility condition. For two clusters $t, s \subset I$ , a standard geometric condition is

$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$

where $B_t$ denotes the bounding box of cluster $t$ , with $\eta > 0$ a parameter controlling separation (Mango et al., 2024, Zechner et al., 2014, Nguyen et al., 2022, Gaddameedi et al., 2023). Admissible blocks are expected to be numerically low-rank and thus suitable for compression.

The decomposition produces a hierarchical matrix structure where:

Near-field (inadmissible) blocks are stored in dense form.
Far-field (admissible) blocks are approximated by low-rank factorizations.

This partitioning continues recursively until leaf clusters reach a preset size or admissibility is achieved.

2. Low-Rank Approximation Strategies for Admissible Blocks

For every admissible block $K_{I,J}$ , the hierarchical method seeks a low-rank decomposition:

$K_{I,J} \approx U_{I,J} V_{I,J}^T, \quad U_{I,J} \in \mathbb{R}^{|I| \times r}, \quad V_{I,J} \in \mathbb{R}^{|J| \times r}$

with modest $r \ll \min(|I|,|J|)$ (Mango et al., 2024, Zechner et al., 2014, Rebrova et al., 2018, Nguyen et al., 2022).

Common techniques include:

Analytic/interpolation-based: Tensorized Chebyshev nodes and Lagrange polynomials (Mango et al., 2024, Zechner et al., 2014).
Algebraic: Adaptive cross approximation (ACA), strong rank-revealing QR (sRRQR), randomized sampling (Mango et al., 2024, Rebrova et al., 2018, Xing et al., 2018, Cai et al., 2022).
Interpolative decomposition (ID): For nested-basis H²-matrices, block-wise ID with proxy points or representative sets (Xing et al., 2018, Cai et al., 2022).
Skeletonization: Selection of skeleton rows/columns to capture the essential range of the block (Gaddameedi et al., 2023).

The choice between these techniques depends on kernel regularity, availability of analytic derivatives, and computational context. For kernels with analytic structure, polynomial interpolation is powerful; otherwise, ACA and data-driven methods are preferred.

Hierarchical decompositions provide local and global control over the approximation error. For an admissible block:

$\varepsilon_{IJ} = \|K_{IJ} - U_{IJ}V_{IJ}^T\|_F$

A block-wise tolerance $\tau_{IJ}$ is imposed, with adaptive mechanisms:

If $\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 0, recursively increase the rank or split the block into children to achieve the target accuracy (Mango et al., 2024).
The global error is bounded by summing over local errors:

$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 1

This ensures convergence of the hierarchical scheme to a user-prescribed error, preserving essential spectral properties needed for applications involving iterative solvers, neural kernel optimization, or eigenanalysis (Mango et al., 2024, Rebrova et al., 2018, Gaddameedi et al., 2023).

4. Computational Complexity and Storage Efficiency

Hierarchical matrix variants exhibit superior storage and arithmetic complexity compared to their dense counterparts.

Format	Storage	Mat-Vec	Direct Solve (LU, Cholesky, ULV)
Dense	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 2	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 3	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 4
H-matrix	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 5	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 6	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 7
HSS (nested)	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 8	$\text{dist}(B_t, B_s) \geq \eta \cdot \max\{\text{diam}(B_t), \text{diam}(B_s)\},$ 9	$B_t$ 0
H²-matrix	$B_t$ 1	$B_t$ 2	$B_t$ 3

Here, $B_t$ 4 denotes the maximal blockwise numerical rank, and $B_t$ 5 is the matrix dimension (Mango et al., 2024, Boukaram et al., 2019, Rebrova et al., 2018, Cai et al., 2022, Zechner et al., 2014). Techniques such as GPU acceleration, tree flattening, and batched linear algebra are employed for practical efficiency on modern hardware (Boukaram et al., 2019).

5. Spectral and Structural Preservation

In applications demanding accuracy in spectral or conditioning properties—e.g., compression of neural tangent kernels in PINNs—the hierarchical methods guarantee perturbation bounds:

$B_t$ 6

where $B_t$ 7 is the matrix approximation tolerance, with $B_t$ 8 dependent on network architecture and kernel smoothness. Matrix perturbation theory asserts that the condition numbers and leading eigenspaces of the compressed matrices are preserved up to $B_t$ 9 (Mango et al., 2024).

Generalization and convergence behavior of the original learning or physics-informed model is maintained, as the training dynamics are only mildly perturbed.

6. Specialized Algorithms and Applications

Various application-driven hierarchical decompositions have been developed:

Physics-Informed Neural Network (PINN) Compression: Error-bounded H-matrix approximation of NTK yields substantial accuracy and speed benefits over SVD, pruning, and quantization across a suite of PDE benchmarks. Inference is accelerated by up to $t$ 0 with sub- $t$ 1 accuracy loss (Mango et al., 2024).
Boundary Element Method (BEM): Isogeometric collocation, geometric partitioning, and kernel interpolation achieve $t$ 2 scaling with exponential error decay as order increases (Zechner et al., 2014).
Gaussian Processes (GP): HODLR and H-matrix decompositions enable scalable $t$ 3 computation of likelihoods, derivatives, and Fisher information (Geoga et al., 2018).
Integral Equation Solvers: Multi-level substructuring and H-matrices realize $t$ 4 complexity for large-scale eigenvalue computations (Gerds, 2017).
Machine Learning Kernels: In kernel ridge regression, hierarchical compression with clustering and blockwise low-rank approximation allows scaling to datasets with millions of instances without loss in accuracy (Rebrova et al., 2018).
Diffusion Maps and Spectral Embeddings: Hierarchical approximations incorporated within iterative Lanczos eigensolvers reduce complexity from $t$ 5 to $t$ 6 per eigenpair computation while preserving spectral accuracy (Gaddameedi et al., 2023).

7. Extensions and Comparisons

Parametric hierarchical matrices further decouple kernel hyperparameter dependence, supporting offline/online workflows with rapid instantiation of H-matrix structures for any parameter value, crucial for applications such as Gaussian process hyperparameter optimization (Khan et al., 5 Nov 2025). Nested basis H² and data-driven HiDR methods achieve true $t$ 7 complexity by constructing global bases via local data reduction strategies, outperforming polynomial interpolation approaches in storage and accuracy for challenging kernels (Cai et al., 2022).

Hierarchically compositional kernels, defined at the function level, not only guarantee strict positive-definiteness but offer natural out-of-sample extensions and efficient $t$ 8 training (Chen et al., 2016).

Alternative techniques, such as SVD-based global low-rank approximation, Nystrom methods, and randomized SVD, may provide similar or improved performance in regimes where the full matrix is globally compressible, but frequently exhibit inferior memory-accuracy tradeoffs for inhomogeneous or high-dimensional datasets (Mango et al., 2024, Gaddameedi et al., 2023).

Hierarchical decomposition of kernel matrices underpins a spectrum of scalable numerical, statistical, and machine-learning algorithms, providing a blend of provable accuracy and computational efficiency across numerous domains. The continued development of adaptive, parametric, and hardware-optimized variants ensures its centrality in the solution of large-scale problems involving kernel matrices.