Exact Generalisation Error for GNNs

Updated 15 September 2025

The paper rigorously characterizes the exact generalisation error of one-hidden-layer GNNs by linking prediction accuracy to graph structure, feature space, and architecture.
It employs tensor initialization and accelerated gradient descent to achieve linear convergence for regression and statistically consistent recovery for classification.
The analysis explicitly relates sample complexity to graph properties, ensuring actionable insights for parameter recovery and practical performance across diverse structures.

Graph neural networks (GNNs) provide a framework for learning representations from graph-structured data. The exact generalisation error for GNNs quantifies their ability to make accurate predictions on unseen data, directly linking GNN performance to properties of the graph, the feature space, the chosen architecture, and the learning algorithm. Recent advances have moved beyond classical loose upper bounds to precise, model- and data-dependent characterisations. In particular, exact generalisation error analysis for GNNs with one hidden layer—under conditions where a ground-truth model exists—offers the first rigorous and practically meaningful theoretical guarantees for parameter recovery and prediction.

1. Theoretical Setting and Model Assumptions

The framework focuses on one-hidden-layer GNNs for both regression and binary classification, assuming the existence of a ground-truth model such that the optimal parameters $W^*$ yield zero generalisation error in the population risk for regression. The key assumptions are:

Node features are i.i.d. standard Gaussian vectors.
Labels are generated via a ground-truth GNN, aggregating node features using a normalized adjacency matrix $A$ reflecting graph structure (with maximum degree $\delta$ , average degree $\delta_{\text{ave}}$ , and largest singular value $\sigma_1(A)$ ).
The GNN consists of $K$ filters, with nonlinear activations: ReLU for regression, sigmoid for classification.
The risk functions considered are empirical and population risks over the training sample $\Omega$ and the feature-label generating distribution; for regression, $f(W) = \frac{1}{2|\Omega|}\sum_{n\in\Omega} |y_n - g(W; x_n)|^2$ .

This setup emphasizes the joint statistical coupling between node features, the aggregation structure imposed by the graph, and the task-specific generation of outputs. Importantly, analysis is local to a strongly convex neighborhood of the optimum $W^*$ , enabled by strong convexity of the Hessian near $W^*$ under suitably accurate initialization.

2. Learning Algorithm: Tensor Initialization and Accelerated Optimization

The learning algorithm addressing the exact generalisation error problem is a two-stage procedure:

Tensor Initialization: Initial parameter estimates are constructed via tensor methods. Specifically, tensors $M_1$ (for scaling) and $M_3$ (for direction) are computed by taking expectations of combinations of node features, labels, and the nonlinearity, reflecting the GNN’s neighbor aggregation structure. The third-order tensor $M_3$ is used to recover the directions of the true weights via tensor decomposition, after a projection informed by $M_2$ (a second-order statistic). Once directions and magnitudes are recovered, initial weights are formed as $W^{(0)}$ .
Accelerated Gradient Descent (AGD): With a well-initialized $W^{(0)}$ , accelerated updates using the heavy-ball method (with step size $\eta$ and momentum $\beta$ ) are performed:

$W^{(t+1)} = W^{(t)} - \eta \nabla \hat{f}_{\Omega_t}(W^{(t)}) + \beta (W^{(t)} - W^{(t-1)}).$

Here, the gradient is computed over a fresh subsample $\Omega_t$ at each iteration. Setting $\beta=0$ recovers standard (vanilla) gradient descent.

For regression, these procedures guarantee exact recovery of $W^*$ ; for binary classification, the algorithm converges to a statistically consistent estimator within $O(\sqrt{1/|\Omega|})$ of $W^*$ .

3. Convergence Guarantees and Generalisation Error

Rigorous convergence results are established under the aforementioned assumptions. For regression:

Linear convergence to $W^*$ is guaranteed with a rate depending on algorithmic and graph parameters:

$\|W^{(t)} - W^*\|_2 \leq \nu(\beta)^t \|W^{(0)} - W^*\|_2.$

The contraction factor for vanilla GD is $\nu(0) \geq 1 - \frac{1 - \varepsilon_0}{88 \kappa^2 \gamma K}$ , with $\kappa = \sigma_1(W^*)/\sigma_K(W^*)$ (condition number), $\gamma$ a product of singular values, and $K$ the number of filters. For optimal acceleration, $\nu(\beta^*) = 1 - \frac{1 - \varepsilon_0}{\sqrt{88 \kappa^2 \gamma K}}$ .

For binary classification:

The estimator converges to a critical point $\widehat{W}$ satisfying

$\|\widehat{W} - W^*\|_2 \leq C_3 (1 - \varepsilon_0)^{-1} \kappa^2 \gamma K \sqrt{\frac{(1+\delta^2)d \log N}{|\Omega|}}.$

Thus, by enlarging the training sample size $|\Omega|$ , the statistical error becomes arbitrarily small.

The generalisation error is therefore precisely quantified—not as an abstract bound, but as an explicit function of the initialization accuracy, graph properties, and optimization hyperparameters.

4. Sample Complexity and Graph Structural Dependencies

A salient feature is the explicit sample complexity required for exact or near-exact recovery of the ground-truth GNN parameters. For regression with a guaranteed convergence neighborhood, it suffices to take

$|\Omega| \geq C_1 \varepsilon_0^{-2} \kappa^9 \gamma^2 (1+\delta^2) \sigma_1^4(A) K^8 d \log N \log(1/\varepsilon),$

where $d$ is the input feature dimension, $N$ is the total number of nodes, and $\varepsilon$ is the risk accuracy.

Key consequences:

Required samples scale linearly with $d$ , polynomially with $K$ , and only logarithmically with $N$ .
The dependence on $(1+\delta^2) \sigma_1^4(A)$ highlights the role of the graph: denser graphs (large $\delta$ or large $\sigma_1(A)$ ) increase sample complexity, reflecting more challenging neighbor-aggregation dependencies.

This structural dependence provides a precise quantification of the inherent difficulty of GNN learning as a function of graph connectivity, completing an important theoretical gap unaddressed in prior analyses.

5. Numerical Validation and Performance Assessment

Empirical studies are conducted on synthetic graphs of varying topology (cycles, grids, random regular graphs, and graphs with bounded degree) and feature dimensionalities. Key observations include:

For both regression and classification, convergence is linear as predicted. AGD consistently requires fewer iterations to achieve a specified error threshold than vanilla GD, confirming theoretical acceleration.
The empirical success rate for exact recovery aligns with the predicted sample complexity: as maximum degree $\delta$ or feature dimension $d$ increases, more samples are needed to recover $W^*$ accurately.
In classification, the empirical distance to $W^*$ decays as $O(\sqrt{1/|\Omega|})$ , in line with statistical theory and indicating that generalisation improves with sample size even if $W^*$ is not a global minimizer for the (nonconvex) cross-entropy loss.

These findings show that the derived guarantees not only apply in theory but are effective for a variety of graph structures and GNN tasks.

6. Implementation Considerations and Practical Trade-offs

Implementing the exact generalisation error guarantees involves several considerations:

Computational complexity: Tensor initialization requires constructing and decomposing high-order moment tensors, with computational cost depending on $d$ and $K$ . For moderate graph and feature sizes, algorithms such as those proposed in the referenced tensor decomposition literature (e.g., KCL15) are tractable.
Algorithm robustness: The AGD update (especially with a large momentum parameter) is sensitive to the conditioning of the local loss landscape; accurate tensor initialization is essential to remain within the strongly convex neighborhood of $W^*$ .
Sample size: In practice, exact recovery is feasible only when the sample size $|\Omega|$ is sufficiently large to dominate graph-induced dependencies (i.e., high $\delta$ or large $\sigma_1(A)$ require more data), otherwise convergence is restricted or statistical error dominates.
Choice of nonlinearity: While the analysis accommodates nonsmooth activations (e.g., ReLU), further generalizations to deeper or more complex nonlinear architectures may require additional conditions or alternative initialization strategies.

A practical implementation of the reported algorithmic scheme in a modern machine learning framework would involve batch computation of statistics for tensor initialization, followed by AGD updates, potentially leveraging standard acceleration techniques.

7. Summary and Impact

This line of analysis provides the first theoretically precise and practically relevant characterisation of the exact generalisation error for one-hidden-layer GNNs in both regression and binary classification. The performance guarantees—linear convergence and explicit generalisation error as a function of graph and model parameters—are obtained using tensor-based initialization and accelerated optimization, with sample complexity explicitly tied to graph structure. Numerical verification supports the theoretical predictions, reinforcing the utility of the derived methods for real-world GNN learning tasks where rigorous generalizability is paramount. This framework closes a critical gap in the literature and provides actionable insights for algorithm and architecture design in graph-based learning systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Exact Generalisation Error for GNNs.