Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ALPHAGMUT: A Rationale-Guided Alpha Shape Graph Neural Network to Evaluate Mutation Effects (2406.09159v1)

Published 13 Jun 2024 in q-bio.QM, cs.AI, cs.CG, and q-bio.GN

Abstract: In silico methods evaluating the mutation effects of missense mutations are providing an important approach for understanding mutations in personal genomes and identifying disease-relevant biomarkers. However, existing methods, including deep learning methods, heavily rely on sequence-aware information, and do not fully leverage the potential of available 3D structural information. In addition, these methods may exhibit an inability to predict mutations in domains difficult to formulate sequence-based embeddings. In this study, we introduce a novel rationale-guided graph neural network AlphaGMut to evaluate mutation effects and to distinguish pathogenic mutations from neutral mutations. We compute the alpha shapes of protein structures to obtain atomic-resolution edge connectivities and map them to an accurate residue-level graph representation. We then compute structural-, topological-, biophysical-, and sequence properties of the mutation sites, which are assigned as node attributes in the graph. These node attributes could effectively guide the graph neural network to learn the difference between pathogenic and neutral mutations using k-hop message passing with a short training period. We demonstrate that AlphaGMut outperforms state-of-the-art methods, including DeepMind's AlphaMissense, in many performance metrics. In addition, AlphaGMut has the advantage of performing well in alignment-free settings, which provides broader prediction coverage and better generalization compared to current methods requiring deep sequence-aware information.

Summary

  • The paper introduces ALPHAGMUT, a novel rationale-guided Graph Neural Network that evaluates missense mutation effects by distinguishing pathogenic from neutral mutations using 3D protein structure information.
  • ALPHAGMUT constructs a residue-level graph from 3D protein structures using alpha shapes and incorporates comprehensive structural, topological, biophysical, and sequence features as rationale-informed node attributes.
  • Results show ALPHAGMUT achieves state-of-the-art performance and superior generalization compared to existing methods for predicting mutation effects across a wide range of human proteins, even in MSA-free settings.

The paper introduces ALPHAGMUT, a novel rationale-guided Graph Neural Network (GNN) designed to evaluate the effects of missense mutations by distinguishing pathogenic mutations from neutral mutations. The method addresses limitations in existing approaches that heavily rely on sequence-aware information and may struggle with mutations in domains lacking robust sequence-based embeddings.

ALPHAGMUT constructs a residue-level graph representation of a 3D protein structure derived from atomic-resolution connections using alpha shapes. The method then assigns structural, topological, biophysical, and sequence properties of mutation sites as node attributes within the graph. The GNN leverages k-hop message passing to capture the impact of neighboring residues on the mutation site, thus predicting mutation effects.

Here's a breakdown of the key components and innovations:

  • Alpha-Shape-Based Graph Model: The method generates an accurate residue-level graph representation of the 3D protein structure, mapped from the atomic-resolution connection graph.
  • Rationale-Informed Node Attributes: Structural, topological, biophysical, and sequence-aware features are computed to reflect biological rationales of mutations and their spatially connected wild-type residues, which are then assigned as node attributes.
  • k-hop Message Passing: A graph neural network with k-hop message passing accounts for neighborhood impact and the mutation site simultaneously for predicting mutation effects.
  • Filtration Steps: Multiple filtration steps are implemented to obtain likely cancer passenger mutations from those observed in real-world cancer patients, providing a better ground truth of passenger mutations with functionally neutral effects.

Methods

The ALPHAGMUT method involves several key steps:

  1. Generating Alpha-Shape Graph:
    • The method computes alpha shapes of the protein structure at an alpha value of 1.4 Å.
    • It maps these shapes to a residue-level graph based on the connection information of atoms located among pairwise residues.
    • The edge and node information of the graph is represented as:

      G=(V,E),VRn,ERmG = (V, E), V \in R^n, E \in R^m

      where:

      • GG is the graph
      • VV is the set of nodes (residues)
      • nn is the number of nodes
      • EE is the set of edges (residue-level connections)
      • mm is the number of edges, with m2n(n1)m \leq 2 * n * (n-1)

* A Breadth-First Search (BFS) algorithm defines the Higher-Order Spatial Unit (HOSU) of the mutation site, capturing residues with layered spatial proximity.

  1. Computing Biological Rationale as Graph Attributes:
    • Features indicating atomic interactions, layered residue contact profiles, local geometric shapes, and biophysical properties are computed from the 3D structure.
    • The alpha shapes of the atomic graph provide detailed atomic interactions between the mutation site and its first-layer residues. The counts of each of the 16 types of atomic interactions with donor/acceptor effects are computed as {Ncc, NON, Nco, Ncs, NNC, NNN, NNO, NNS, NOC, NON, NOO, NOS, Nsc, NSN, Nso, Nss}, based on atomic contact of heavy atoms (Carbon(C), Nitrogen(N), Oxygen(O), Sulfur(S)).
    • Solvent accessible surface area, salt bridge, and local geometric shape (buried, pocket, surface) of each residue are also computed.
    • Three layered residue contact profiles based on the newly built HOSU are retrieved, represented as {NAlaL1, NAlaL2, NAlaL3, NArgL1, NArgL2, NArgL3, . . . , NValL1, NvalL2, NValL3}, totaling 60 features.
    • Static biophysical changes upon mutations, including changes of amino acid polarity, side-chain carbon, oxygen, nitrogen, and sulfur changes, and backbone carbon change are computed.
    • The BLOSUM62 substitution score serves as the MSA-free sequence-aware feature.
    • The features are vectorized and concatenated as the node attribute, obtaining the full information of the graph:

      G=(V,E,H),VRn,ERm,HRnqG = (V, E, H), V \in R^n, E \in R^m, H \in R^{n*q}

      where:

      • HH is the feature matrix of node attributes
      • q=89q = 89 for the MSA-informed model
      • q=86q = 86 for the MSA-free model
  2. Graph Neural Network:
    • A graph convolution network learns the patterns of mutation effects. Message passing updates the latent representation of feature matrix H, incorporating the neighboring impact of spatially contacted nodes and the mutation node simultaneously.
    • The update rule is defined as:

      hi(k+1)=f(k+1)(W(k+1)vjN(vi)hj(k)djdi+B(k+1)hi(k))h_i^{(k+1)} = f^{(k+1)} (W^{(k+1)} \cdot \sum_{v_j \in N(v_i)} \frac{h_j^{(k)}}{\sqrt{d_j \cdot d_i}} + B^{(k+1)} \cdot h_i^{(k)})

      where:

      • N(vi)N(v_i) is the set of neighboring nodes connected to the target (point mutation) node ii
      • viv_i is the target node
      • vjv_j is the neighboring node
      • did_i is the number of edges in the target node ii
      • djd_j is the number of edges in the neighboring node jj
      • fkf^k, WkW^k, BkB^k are learnable parameters
      • hvih_{v_i} is the raw input feature vector of target node ii

* The model uses k=4k = 4 convolution layers, capturing information expanded to the fourth layers nodes. Tanh is used as the activation function, and the Adam function is the optimizer, with a learning rate of lr=0.0001lr = 0.0001 and weight decay of 0.0005.

Results

The performance of ALPHAGMUT was compared with state-of-the-art methods, including ALPHAMISSENSE, EVE, GMVP, and POLYPHEN-2. The ALPHAGMUT MSA-informed model achieved the best overall performance, indicated by AUROC, AUPRC, MCC, F-1 score, specificity, accuracy, and precision. The ALPHAGMUT MSA-free model also provided comparable performance, suggesting that non-MSA features offer excellent distinguishing power.

The method demonstrates better generalization by predicting mutation saturation of 2,267 human proteins with full-length PDB structures in high quality. ALPHAGMUT MSA-free can predict mutation effects for 2,266 proteins with all mutation patterns, while EVE exhibits significantly lower prediction coverage, providing predictions on only 568 proteins.

The paper concludes that ALPHAGMUT introduces a novel alpha-shape graph, accurately representing residue connections in 3D structures, and computes rationale-driven features as graph attributes, efficiently guiding the GNN to learn the functional impacts of missense mutations and achieve favorable performance. The method also performs well in MSA-free settings and requires feasible computation hardware in a short training period efficiently, leading to better performance, interpretability, and generalization.