Graph of Molecule Substructures (GoMS)

Updated 20 December 2025

Graph of Molecule Substructures (GoMS) is a representation that encodes chemically meaningful fragments as nodes and their interactions as edges.
It employs substructure extraction and advanced graph neural networks to maintain connectivity and enhance property prediction accuracy.
GoMS enables efficient similarity analysis, interpretable generative pathways, and scalable modeling for complex chemical systems.

A Graph of Molecule Substructures (GoMS) is a structured representation wherein each node corresponds to a chemically meaningful substructure (cycle, functional group, motif, fragment, or cluster) extracted from the full molecular graph, and edges encode their mutual relationships such as overlap, connectivity, spatial arrangement, or merge-compatibility. GoMS constructions have emerged as high-impact formalism for molecular similarity, property prediction, generative modeling, and transfer learning, by enabling scalable reasoning about molecular structure at the fragment or motif level while retaining essential connectivity information lost in bag-of-substructures models (Qu et al., 13 Dec 2025).

1. Formal Definitions and Substructure Extraction

The molecular graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ comprises non-hydrogen atoms (vertices) and covalent bonds (edges). GoMS construction begins by mapping $\mathcal{G}$ to a set of induced subgraphs $S_\pi(\mathcal{G}) = \{s_1, \dots, s_k\}$ , with each $s_i = (\mathcal{V}_i, \mathcal{E}_i)$ corresponding to a substructure, according to a chemically informed fragmentation rule $\pi$ (e.g., RECAP, BRICS, RGB, motif enumeration, cycle detection, functional-group extraction, frequent subgraph mining, or ring perception algorithms).

Nodes in GoMS represent these substructures. The edge set $\mathcal{E}_s$ is defined by explicit topological rules. For example, in (Qu et al., 13 Dec 2025), edges $(v_i,v_j)$ are added if $\mathcal{V}_i \cap \mathcal{V}_j \neq \varnothing$ (overlap) or if there exists $(u,v)\in\mathcal{E}$ with $u \in \mathcal{V}_i, v \in \mathcal{V}_j$ (bond-bridged). Other works use cycle overlap (Nouleho et al., 2018), expansion adjacency (Yesiltepe et al., 2021), motif dictionaries (Xu et al., 24 Oct 2025), ring-clique trees (Yamada et al., 2022), or merge-compatibility relations for generative assembly (Yamada et al., 2022).

Substructure extraction algorithms typically involve:

Cycle generators using Horton’s algorithm to efficiently enumerate elementary cycles up to length $j$ (Nouleho et al., 2018)
Graph traversal-based enumeration (BFS/DFS) with canonicalization (as in SPECTRe (Yesiltepe et al., 2021)).
Motif extraction via partitioning on bridge bonds and ring detection (Xu et al., 24 Oct 2025)
Frequent subgraph mining with support thresholds using gSpan on junction tree–decomposed graphs (Yamada et al., 2022)
Functional group identification and “hyper-atom” aggregation (Lukashina et al., 2020)

2. Node and Edge Feature Construction

Nodes in GoMS bear feature encodings derived from the underlying substructures:

Atom/bond-type frequency vectors, summing or averaging over constituent elements (Qu et al., 13 Dec 2025, Lukashina et al., 2020)
E( $n$ )-equivariant GNN embeddings for 3D-aware representation (Qu et al., 13 Dec 2025)
Ring type, aromaticity, charge, external valence, sum of atomic masses, and one-hot motif identifiers (Lukashina et al., 2020, Xu et al., 24 Oct 2025)
Cycle length, number of shared bonds for cycle overlaps (Nouleho et al., 2018)
Canonical SMILES for uniqueness and subsequent hashing (Yesiltepe et al., 2021)
Message-passing–based encodings for cluster and atom nodes in multi-resolution hierarchies (Jin et al., 2019, Jin et al., 2018)

Edge features encode interactions between substructures:

Overlap ratios $|\mathcal{V}_i \cap \mathcal{V}_j| / \min(|\mathcal{V}_i|,|\mathcal{V}_j|)$ (Qu et al., 13 Dec 2025)
Chemistry-based similarity metrics, notably Tanimoto similarity of ECFP4 fingerprints (Qu et al., 13 Dec 2025)
Geometry, including centroid distance (RBF expansion), inter-motif orientation, dihedral angles (Qu et al., 13 Dec 2025)
Expansion/Contraction adjacency flags (subset/superset relations) (Yesiltepe et al., 2021)
Merge compatibility for node and edge overlay (used for generative assembly) (Yamada et al., 2022)
Motif-molecule links in global context graphs (Xu et al., 24 Oct 2025)

3. Graph Neural Architectures for GoMS

GoMS representations feed advanced GNN architectures:

Graph Transformers over the motif subgraph, employing multi-headed attention on node and relational features (Qu et al., 13 Dec 2025)
E( $n$ )-equivariant GNNs to handle 3D coordinates during substructure encoding (Qu et al., 13 Dec 2025)
Standard MPNN and D-MPNN frameworks, with concatenated substructure and atom-level encodings (Lukashina et al., 2020)
Tri-partite heterogeneous context graphs for few-shot learning, linking motif, molecule, and property nodes, followed by structure-aware normalization and local-global encoding (Xu et al., 24 Oct 2025)
Hierarchical message-passing in multi-layer graphs with atom, attachment, and substructure nodes (Jin et al., 2019)
Junction tree encoders/decoders for coarse-to-fine generative modeling, enforcing chemical validity via multilevel scaffolding (Jin et al., 2018)

Readout strategies encompass graph-level pooling of substructure embeddings, fusion of global and local contexts (Xu et al., 24 Oct 2025), or downstream property prediction via fully connected layers.

Table: Comparison of Representative GoMS Algorithms

Paper (arXiv id)	Node Definition	Edge Rule	Main Graph Model
(Qu et al., 13 Dec 2025)	Chemically-extracted fragments	Overlap/bond-bridge	Graph Transformer
(Nouleho et al., 2018)	Elementary cycles ( $\leq j$ )	Shared vertex/isthmus	MCES for similarity
(Yesiltepe et al., 2021)	SMILES fragments (BFS/DFS)	Growth/overlap lattice	Fragment lattice
(Yamada et al., 2022)	Frequent subgraphs (gSpan)	Merge compatibility	RL-guided reassembly
(Lukashina et al., 2020)	Functional groups/hyper-atoms	N/A (no edges)	D-MPNN + FFNN
(Xu et al., 24 Oct 2025)	Top-K motifs from corpus	Molecule-motif links	Global-local GNN

4. Theoretical Properties and Discriminative Guarantees

GoMS architectures resolve significant limitations of bag-of-fragment models by encoding arrangement-preserving isomorphism: two molecules with the same multisets of subgraphs but different connectivity yield non-isomorphic GoMS graphs (see Theorem 3.1.1 in (Qu et al., 13 Dec 2025)). This injectivity is a consequence of multi-view edge features and hierarchical consistency with substructure overlap thresholds.

Hierarchical GoMS decompositions (junction-tree frameworks (Jin et al., 2018, Jin et al., 2019)) guarantee strict chemical validity in generative processes, as every coarse node explicitly corresponds to a chemically sound motif or cycle, with attachment rules enforced at decode time. Arrangement-aware models can distinguish molecules with identical subgraph content but distinct functional group placements, which is essential for accurate materials and drug property modeling (Qu et al., 13 Dec 2025).

5. Experimental Benchmarks and Empirical Outcomes

GoMS models demonstrate superior computational tractability and accuracy across diverse chemical datasets:

On ChEBI (90K molecules), similarity via MCES on GoMS cycle graphs completed all comparisons in $\ll$ 1s, compared to 20–40s for full molecular graph MCES; GoMS robustly identified isomers and analogues, outperforming baseline atom-level similarity (Nouleho et al., 2018).
For property prediction in large molecules (OLEDs, 100–500 atoms), GoMS Graph Transformer achieved MAE=0.25 eV for $S_1$ —a more than 60% improvement over previous ESAN and GIN methods. The performance gap increases with molecule size (Qu et al., 13 Dec 2025).
SPECTRe, by exhaustively enumerating substructures (up to $10^5$ per molecule) and forming lattice graphs, supports virtual screening and complexity analysis for molecules up to 26 heavy atoms (Yesiltepe et al., 2021).
GoMS-based multitask networks yield state-of-the-art RMSE for logP/logD prediction tasks, with pronounced gains on symmetric molecule subsets due to functional group embeddings (Lukashina et al., 2020).
GoMS reinforcement-learning assembly over frequent subgraph libraries produces 100% validity, near-perfect uniqueness/novelty, and strong property optimization scores in molecular generation (Yamada et al., 2022).
Global-local motif-context graphs enable few-shot learning on molecular property tasks, transferring motif knowledge across molecule-property pairs for enhanced generalization (Xu et al., 24 Oct 2025).

6. Interpretability, Scalability, and Applications

GoMS builds interpretable representations by explicitly relating chemical subunits, supporting retrosynthetic analysis (Nouleho et al., 2018), structure-based similarity search (Yesiltepe et al., 2021), modular generative assembly (Yamada et al., 2022), transfer learning (Xu et al., 24 Oct 2025), multitask regression (Lukashina et al., 2020), and hierarchical molecular design (Jin et al., 2018, Jin et al., 2019).

Scalability is achieved by reducing the number of nodes in the coarse graph relative to atomic graphs, exploiting chemical knowledge for motif selection, efficient enumeration, and imposing arrangement-aware pooling. GoMS outperforms prior subgraph bag approaches—such as ESAN—particularly in industrial molecules, where spatial organization of functional groups governs emergent properties (Qu et al., 13 Dec 2025).

Chemically valid generative pathways in GoMS frameworks avoid nonphysical intermediates via fragment-wise masking, motif preservation, score-based assembly, and multi-resolution validation, as seen in RL, energy-based, and junction-tree approaches (Yamada et al., 2022, Hataya et al., 2021, Jin et al., 2018).

7. Outlook and Open Challenges

The use of GoMS is well established in quantum property prediction, generative chemistry, few-shot learning, complexity analysis, and similarity metrics. Current research is extending GoMS toward:

Enhanced hierarchical design to capture polydisperse or crosslinked architectures.
Integration of 3D spatial features, dihedral angles, and long-range inter-motif interactions (Qu et al., 13 Dec 2025).
Application to materials science, polymers, and structure–function inference for complex industrial compounds.
Improved motif enumeration and functional group extraction, leveraging large-scale chemical corpora (Xu et al., 24 Oct 2025).
Theoretical foundation for isomorphism and hierarchical embedding stability across multi-level GoMS decompositions.

A plausible implication is that continued development in arrangement-preserving, multi-view, and hierarchical GoMS frameworks will further advance interpretable, scalable modeling of chemical and material systems, with implications for synthesis planning, virtual screening, and rational molecule design.