Papers
Topics
Authors
Recent
2000 character limit reached

Exact Generalisation Error for GNNs

Updated 15 September 2025
  • The paper rigorously characterizes the exact generalisation error of one-hidden-layer GNNs by linking prediction accuracy to graph structure, feature space, and architecture.
  • It employs tensor initialization and accelerated gradient descent to achieve linear convergence for regression and statistically consistent recovery for classification.
  • The analysis explicitly relates sample complexity to graph properties, ensuring actionable insights for parameter recovery and practical performance across diverse structures.

Graph neural networks (GNNs) provide a framework for learning representations from graph-structured data. The exact generalisation error for GNNs quantifies their ability to make accurate predictions on unseen data, directly linking GNN performance to properties of the graph, the feature space, the chosen architecture, and the learning algorithm. Recent advances have moved beyond classical loose upper bounds to precise, model- and data-dependent characterisations. In particular, exact generalisation error analysis for GNNs with one hidden layer—under conditions where a ground-truth model exists—offers the first rigorous and practically meaningful theoretical guarantees for parameter recovery and prediction.

1. Theoretical Setting and Model Assumptions

The framework focuses on one-hidden-layer GNNs for both regression and binary classification, assuming the existence of a ground-truth model such that the optimal parameters WW^* yield zero generalisation error in the population risk for regression. The key assumptions are:

  • Node features are i.i.d. standard Gaussian vectors.
  • Labels are generated via a ground-truth GNN, aggregating node features using a normalized adjacency matrix AA reflecting graph structure (with maximum degree δ\delta, average degree δave\delta_{\text{ave}}, and largest singular value σ1(A)\sigma_1(A)).
  • The GNN consists of KK filters, with nonlinear activations: ReLU for regression, sigmoid for classification.
  • The risk functions considered are empirical and population risks over the training sample Ω\Omega and the feature-label generating distribution; for regression, f(W)=12ΩnΩyng(W;xn)2f(W) = \frac{1}{2|\Omega|}\sum_{n\in\Omega} |y_n - g(W; x_n)|^2.

This setup emphasizes the joint statistical coupling between node features, the aggregation structure imposed by the graph, and the task-specific generation of outputs. Importantly, analysis is local to a strongly convex neighborhood of the optimum WW^*, enabled by strong convexity of the Hessian near WW^* under suitably accurate initialization.

2. Learning Algorithm: Tensor Initialization and Accelerated Optimization

The learning algorithm addressing the exact generalisation error problem is a two-stage procedure:

  • Tensor Initialization: Initial parameter estimates are constructed via tensor methods. Specifically, tensors M1M_1 (for scaling) and M3M_3 (for direction) are computed by taking expectations of combinations of node features, labels, and the nonlinearity, reflecting the GNN’s neighbor aggregation structure. The third-order tensor M3M_3 is used to recover the directions of the true weights via tensor decomposition, after a projection informed by M2M_2 (a second-order statistic). Once directions and magnitudes are recovered, initial weights are formed as W(0)W^{(0)}.
  • Accelerated Gradient Descent (AGD): With a well-initialized W(0)W^{(0)}, accelerated updates using the heavy-ball method (with step size η\eta and momentum β\beta) are performed:

W(t+1)=W(t)ηf^Ωt(W(t))+β(W(t)W(t1)).W^{(t+1)} = W^{(t)} - \eta \nabla \hat{f}_{\Omega_t}(W^{(t)}) + \beta (W^{(t)} - W^{(t-1)}).

Here, the gradient is computed over a fresh subsample Ωt\Omega_t at each iteration. Setting β=0\beta=0 recovers standard (vanilla) gradient descent.

For regression, these procedures guarantee exact recovery of WW^*; for binary classification, the algorithm converges to a statistically consistent estimator within O(1/Ω)O(\sqrt{1/|\Omega|}) of WW^*.

3. Convergence Guarantees and Generalisation Error

Rigorous convergence results are established under the aforementioned assumptions. For regression:

  • Linear convergence to WW^* is guaranteed with a rate depending on algorithmic and graph parameters:

W(t)W2ν(β)tW(0)W2.\|W^{(t)} - W^*\|_2 \leq \nu(\beta)^t \|W^{(0)} - W^*\|_2.

The contraction factor for vanilla GD is ν(0)11ε088κ2γK\nu(0) \geq 1 - \frac{1 - \varepsilon_0}{88 \kappa^2 \gamma K}, with κ=σ1(W)/σK(W)\kappa = \sigma_1(W^*)/\sigma_K(W^*) (condition number), γ\gamma a product of singular values, and KK the number of filters. For optimal acceleration, ν(β)=11ε088κ2γK\nu(\beta^*) = 1 - \frac{1 - \varepsilon_0}{\sqrt{88 \kappa^2 \gamma K}}.

For binary classification:

  • The estimator converges to a critical point W^\widehat{W} satisfying

W^W2C3(1ε0)1κ2γK(1+δ2)dlogNΩ.\|\widehat{W} - W^*\|_2 \leq C_3 (1 - \varepsilon_0)^{-1} \kappa^2 \gamma K \sqrt{\frac{(1+\delta^2)d \log N}{|\Omega|}}.

Thus, by enlarging the training sample size Ω|\Omega|, the statistical error becomes arbitrarily small.

The generalisation error is therefore precisely quantified—not as an abstract bound, but as an explicit function of the initialization accuracy, graph properties, and optimization hyperparameters.

4. Sample Complexity and Graph Structural Dependencies

A salient feature is the explicit sample complexity required for exact or near-exact recovery of the ground-truth GNN parameters. For regression with a guaranteed convergence neighborhood, it suffices to take

ΩC1ε02κ9γ2(1+δ2)σ14(A)K8dlogNlog(1/ε),|\Omega| \geq C_1 \varepsilon_0^{-2} \kappa^9 \gamma^2 (1+\delta^2) \sigma_1^4(A) K^8 d \log N \log(1/\varepsilon),

where dd is the input feature dimension, NN is the total number of nodes, and ε\varepsilon is the risk accuracy.

Key consequences:

  • Required samples scale linearly with dd, polynomially with KK, and only logarithmically with NN.
  • The dependence on (1+δ2)σ14(A)(1+\delta^2) \sigma_1^4(A) highlights the role of the graph: denser graphs (large δ\delta or large σ1(A)\sigma_1(A)) increase sample complexity, reflecting more challenging neighbor-aggregation dependencies.

This structural dependence provides a precise quantification of the inherent difficulty of GNN learning as a function of graph connectivity, completing an important theoretical gap unaddressed in prior analyses.

5. Numerical Validation and Performance Assessment

Empirical studies are conducted on synthetic graphs of varying topology (cycles, grids, random regular graphs, and graphs with bounded degree) and feature dimensionalities. Key observations include:

  • For both regression and classification, convergence is linear as predicted. AGD consistently requires fewer iterations to achieve a specified error threshold than vanilla GD, confirming theoretical acceleration.
  • The empirical success rate for exact recovery aligns with the predicted sample complexity: as maximum degree δ\delta or feature dimension dd increases, more samples are needed to recover WW^* accurately.
  • In classification, the empirical distance to WW^* decays as O(1/Ω)O(\sqrt{1/|\Omega|}), in line with statistical theory and indicating that generalisation improves with sample size even if WW^* is not a global minimizer for the (nonconvex) cross-entropy loss.

These findings show that the derived guarantees not only apply in theory but are effective for a variety of graph structures and GNN tasks.

6. Implementation Considerations and Practical Trade-offs

Implementing the exact generalisation error guarantees involves several considerations:

  • Computational complexity: Tensor initialization requires constructing and decomposing high-order moment tensors, with computational cost depending on dd and KK. For moderate graph and feature sizes, algorithms such as those proposed in the referenced tensor decomposition literature (e.g., KCL15) are tractable.
  • Algorithm robustness: The AGD update (especially with a large momentum parameter) is sensitive to the conditioning of the local loss landscape; accurate tensor initialization is essential to remain within the strongly convex neighborhood of WW^*.
  • Sample size: In practice, exact recovery is feasible only when the sample size Ω|\Omega| is sufficiently large to dominate graph-induced dependencies (i.e., high δ\delta or large σ1(A)\sigma_1(A) require more data), otherwise convergence is restricted or statistical error dominates.
  • Choice of nonlinearity: While the analysis accommodates nonsmooth activations (e.g., ReLU), further generalizations to deeper or more complex nonlinear architectures may require additional conditions or alternative initialization strategies.

A practical implementation of the reported algorithmic scheme in a modern machine learning framework would involve batch computation of statistics for tensor initialization, followed by AGD updates, potentially leveraging standard acceleration techniques.

7. Summary and Impact

This line of analysis provides the first theoretically precise and practically relevant characterisation of the exact generalisation error for one-hidden-layer GNNs in both regression and binary classification. The performance guarantees—linear convergence and explicit generalisation error as a function of graph and model parameters—are obtained using tensor-based initialization and accelerated optimization, with sample complexity explicitly tied to graph structure. Numerical verification supports the theoretical predictions, reinforcing the utility of the derived methods for real-world GNN learning tasks where rigorous generalizability is paramount. This framework closes a critical gap in the literature and provides actionable insights for algorithm and architecture design in graph-based learning systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Exact Generalisation Error for GNNs.