Graph Isomorphism Networks (GINs)

Updated 24 November 2025

Graph Isomorphism Networks (GINs) are graph neural networks that use injective sum aggregation and MLP updates to match the 1-WL isomorphism test in distinguishing non-isomorphic graphs.
They operate under a message-passing framework with self-loop correction and sum pooling, ensuring unique multiset representations through injective mappings.
Empirical evaluations show GINs achieve state-of-the-art performance on graph and molecular classification tasks, while requiring careful hyperparameter tuning to mitigate overfitting.

Graph Isomorphism Networks (GINs) are a prominent class of message-passing graph neural networks designed to achieve maximal expressive power within the conventional neighborhood aggregation paradigm. By employing injective aggregation and update functions, GINs match the ability of the 1-dimensional Weisfeiler–Lehman (1-WL) isomorphism test to distinguish non-isomorphic graphs, overcoming key representational limitations found in earlier GNN variants. GINs have demonstrated strong empirical performance across diverse graph classification benchmarks, particularly in data-rich environments where their expressive capacity is fully leveraged (Xu et al., 2018, Sato, 2020, Kalian et al., 22 Jul 2025).

1. Architectural Definition and Mathematical Formalism

GINs operate within the “aggregate-and-combine” (message-passing) GNN framework. In a GIN, node features are recursively updated across $K$ layers by injecting the node’s own features and aggregating those of its neighbors via a simple sum, followed by a multilayer perceptron (MLP). The per-layer update is given by

$h_v^{(k)} = \mathrm{MLP}^{(k)}\bigl((1+\varepsilon^{(k)})\,h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\bigr)$

where $h_v^{(k)}$ is the feature vector of node $v$ at layer $k$ , $\varepsilon^{(k)}$ is a scalar (either fixed or learnable), and $\mathcal{N}(v)$ denotes the neighbors of $v$ (Xu et al., 2018, Dablander, 20 Nov 2024, Kim et al., 2020).

Key roles:

Sum as Aggregator: The sum ensures that different multisets of input vectors result in different aggregate values, crucial for injectivity.
Self-loop Term $(1+\varepsilon^{(k)})\,h_v^{(k-1)}$ : Allows the model to control the influence of self-information versus neighbors at each layer.
Expressive MLP: A sufficient MLP (at least two layers with ReLU) enables injective mappings post-aggregation. A one-layer perceptron provably lacks this injectivity (Xu et al., 2018).

For a graph-level embedding suitable for classification, GIN applies a permutation-invariant pooling (sum or mean) at each layer, concatenating pooled node features across all layers: $h_G = \mathrm{CONCAT}_{k=0..K}~\mathrm{READOUT}\bigl(\{h_v^{(k)} \ |\ v\in G\} \bigr)$ Using sum pooling at each layer preserves the injectivity guarantee (Xu et al., 2018, Kim et al., 2020).

2. Theoretical Expressiveness: Connection to the 1-WL Test

GIN is theoretically as powerful as the 1-WL graph isomorphism test. The 1-WL test distinguishes graphs by iteratively refining node labels by hashing the multiset of neighbors and the node’s own current “color.” GIN reproduces the same refinement at each layer by:

Employing an injective neighborhood aggregator (sum),
Augmenting with a (learnable) self-weight,
Applying an injective MLP.

The formal result is: any neighborhood-aggregation GNN with injective aggregation, combination, and readout is at most as discriminative as the 1-WL test; GIN matches this upper bound (Xu et al., 2018, Sato, 2020). This is supported via the universality of the DeepSets construction for functions on multisets: any function on finite multisets can be expressed as $\rho(\sum_{x\in X} \varphi(x))$ for suitable $\varphi$ and $\rho$ . In GIN, $\varphi$ is the identity and $\rho$ is the MLP (Sato, 2020).

GINs thus strictly surpass earlier GNNs in distinguishing power: mean and max aggregation (as in GCN/GraphSAGE) are not injective for multisets and can collapse non-isomorphic structures into identical embeddings (Xu et al., 2018, Sato, 2020).

3. Algorithmic Details, Hyperparameters, and Training Sensitivities

A forward pass of GIN consists of initializing node features, then performing $K$ rounds of sum-aggregation and MLP update. The graph-level readout is constructed via pooling and concatenation. The per-layer computational complexity is $O(|E|d + |V|d^2)$ , where $d$ is the feature dimension (Xu et al., 2018).

Standard hyperparameters include:

GIN layers: Typical values $K=2$ –$7$;
MLP depth: $2$–$5$ per GIN layer, hidden dimension $16$–$500$ (Kalian et al., 22 Jul 2025, Dablander, 20 Nov 2024);
Activation: ReLU or LeakyReLU (the latter yields small but consistent gains);
Aggregator: SUM is recommended for maximal expressivity; MEAN or MAX only when large node degrees demand;
Optimizer: Adagrad outperforms Adam across standard benchmarks, with a recommended learning rate around $0.01$ (Rahman, 2020);
Batch normalization and dropout stabilize training.
Embedding dimension: $32$–$128$ for typical applications.

Empirical studies show that deeper MLPs within each GIN layer yield more benefit than simply stacking more GIN layers (Rahman, 2020). Overfitting can occur with excess capacity in small-data regimes.

Table: Illustration of sensitivity to hyperparameters (from (Rahman, 2020))

Component	Default	Observed Best Practice
Optimizer	Adam	Adagrad
Activation	ReLU	LeakyReLU
Aggregator	SUM	SUM
Learning Rate	0.01	0.01 (or 0.02 for bio)
Emb. Dim.	64	32–64
MLP Depth	2	2–3

4. Empirical Performance and Benchmarking

GINs achieve state-of-the-art or tied results on a wide range of graph classification datasets, both in bioinformatics (MUTAG, PROTEINS, NCI1, etc.) and social networks (IMDB-BINARY, REDDIT-BINARY, COLLAB, etc.) (Xu et al., 2018, Sato, 2020). On training data, both GIN-ε (learned self-weight) and GIN-0 (fixed) can fit nearly 100%. On test sets, accuracies are consistently at or above previous GNN baselines (e.g., IMDB-BINARY $\sim$ 75%, PROTEINS $\sim$ 76%, MUTAG $\sim$ 89%).

In cheminformatics and molecular property prediction, GINs have been shown to outperform GCNs and GATs on most data-abundant binary toxicological assays, achieving AUCs between $0.793$–$0.849$ (averaged across folds) (Kalian et al., 22 Jul 2025). The advantage diminishes or reverses in extremely data-scarce regimes, where attention-based GATs sometimes generalize better.

For quantitative structure–activity relationship (QSAR) regression, GIN-based models were outperformed by classical ECFP-MLP baselines (mean absolute error $\approx0.44$ vs. $0.42$ for GIN and ECFP, respectively) (Dablander, 20 Nov 2024). However, GINs outperform ECFPs in “activity cliff” detection—tasks specifically requiring sensitivity to subtle graph variations.

As an additional application, GINs have been successfully applied to large-scale resting-state fMRI functional connectivity analysis, producing competitive sex-classification accuracy ( $84.6\%$ ; prior non-GNN baselines $68$– $80\%$ ) and producing neuroscientifically interpretable saliency maps using adapted CNN-oriented tools (Kim et al., 2020).

5. Unique Properties and Theoretical Distinctions

GINs’ use of sum aggregation distinguishes them as “injective multiset function approximators”—that is, for any finite multiset input, GIN’s aggregate is unique (up to the capacity of the MLP). This matches the theoretical discriminative ceiling of any message-passing GNN under the 1-WL framework (Xu et al., 2018, Sato, 2020).

MLP universality ensures that, under mild conditions, GIN can realize any injective function over node and neighborhood aggregates, provided the input feature space is sufficiently rich and the MLP is wide enough.
One-hot input encoding is recommended for maximal expressivity when input node features are discrete, as it ensures no two multisets sum to the same value (Kim et al., 2020).
Graph-level feature pooling: Concatenation (“Jumping Knowledge” aggregation) from all layers captures multi-radius features for input graphs (Xu et al., 2018).

A novel perspective notes that the GIN layer can be interpreted as a two-tap convolutional filter in graph space, where the adjacency matrix acts as a “shift” operator, dual to classical 1D CNNs (Kim et al., 2020).

6. Limitations and Open Directions

GINs, while maximally expressive within the neighborhood-aggregation class, cannot distinguish graphs that 1-WL fails to separate—such as certain regular graphs and the Cai–Fürer–Immerman (CFI) gadget. Extensions beyond 1-WL include higher-order GNNs and permutation-invariant/equivariant networks (Sato, 2020). Randomized features (rGIN) can probabilistically push expressivity slightly beyond 1-WL (Sato, 2020).

In practice, GINs require careful regularization in small-data regimes due to high parameter count in their MLPs, and are susceptible to over-smoothing if layers are stacked excessively without attention to regularization or architectural enhancements (e.g., Jumping Knowledge, adapted self-weights) (Xu et al., 2018, Rahman, 2020).

Current limitations and research frontiers include:

Tightness of MLP width/depth requirements for injectivity on unbounded or continuous domains;
Expressivity and robustness on noisy input features;
Efficient, scalable approximations to higher-order GNNs and their links to logic and algorithmic complexity (Sato, 2020);
The need for advanced pooling, self-supervised pre-training, and domain-specific feature embeddings for tasks such as molecular property prediction (Dablander, 20 Nov 2024).

7. Practical Guidelines and Comparative Insights

Empirical studies recommend the following for GIN implementations:

Use Adagrad optimizer for better adaptation to irregular graph structures (Rahman, 2020).
Employ LeakyReLU activations to avoid dead units and preserve gradient flow.
Prefer SUM aggregation for tasks requiring fine-grained structural discrimination. MEAN/MAX should only be considered when the task or data scale necessitates.
For computational efficiency, invest in deeper MLP blocks per GIN layer rather than excessive stacking of GIN layers.
Adapt hyperparameter search to the unique structure and scale of GINs, as their optima typically diverge substantially from mean/max GNNs (GCN/GAT) (Kalian et al., 22 Jul 2025).
In molecular applications, inclusion of chemically rational node and edge features, careful data splitting by substructure, and advanced pooling can significantly impact performance (Dablander, 20 Nov 2024, Kalian et al., 22 Jul 2025).

Table: Comparative AUC of GINs vs. Other GNNs in Molecular Assays (Kalian et al., 22 Jul 2025)

Dataset	GIN AUC (mean)	GAT AUC (mean)	GCN AUC (mean)
ATG_PXRE_CIS	0.849	~0.83	~0.83
LTEA_HepaRG_UGT1A1	0.842	~0.83	~0.82
NVS_ENZ_hBACE (rare)	0.784	0.829	~0.80

GINs require significantly different hyperparameter configurations than GCNs or GATs and offer higher expressivity at the cost of increased risk of overfitting in small data regimes.

Graph Isomorphism Networks thus represent the apex of discriminative power for message-passing GNNs. They leverage injective sum aggregation followed by expressive nonlinear updates to exactly match the 1-WL test’s ability to distinguish graphs, which translates into empirical advantages in graph and molecular classification tasks, especially when large labeled datasets are available. Open research continues into surpassing the 1-WL barrier, improving scalability and expressivity, and tailoring architecture choices to domain- and task-specific requirements (Xu et al., 2018, Sato, 2020, Dablander, 20 Nov 2024, Kalian et al., 22 Jul 2025, Rahman, 2020, Kim et al., 2020).