SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention

Published 19 Apr 2026 in cs.LG and cs.AI | (2604.17324v1)

Abstract: Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in LLMs: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in LLMs, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces SigGate-GT, which applies per-head, element-wise sigmoid gating after softmax attention to suppress uninformative outputs and mitigate over-smoothing.
The gating mechanism increases effective output rank and stabilizes gradient flow, yielding statistically significant improvements on molecular and peptide graph benchmarks.
Experimental results show that SigGate-GT outperforms methods like GraphGPS and DropEdge with minimal parameter overhead and enhanced training robustness.

SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention

Summary of Contributions and Motivation

The paper introduces SigGate-GT, a Graph Transformer (GT) architecture that addresses critical deficiencies of deep graph transformers: over-smoothing, attention entropy collapse, and training instability. Rooted in the observation that softmax attention’s sum-to-one constraint—well-known to induce attention sinks in LLMs—similarly harms graph transformers by mandating that every node attends somewhere even when no semantically meaningful target exists, the work proposes a mechanism for selective suppression of attention outputs. The SigGate-GT module implements per-head, element-wise sigmoid gating directly after the softmax attention and value aggregation (SDPA), allowing attention heads to mute uninformative connections. This innovation, originally validated in the LLM context, is transplanted to GTs within the GraphGPS backbone.

Technical Approach

SigGate-GT extends the standard GPS layer by inserting a sigmoid-gated attention module. Specifically, for each attention head $k$ , after computing

$\text{head}_k = \text{softmax}(QK^\top/\sqrt{d_k})V,$

the model applies an element-wise, learned gating vector

$g_k = \sigma(HW^g_k + b^g_k),$

with $H$ the node embeddings, $W^g_k, b^g_k$ the projections, and $\sigma(\cdot)$ the sigmoid function. The output per head thus becomes

$\text{head}_k^{\text{gated}} = \text{head}_k \odot g_k.$

Key properties:

Suppression of Uninformative Outputs: Heads may output zero (or near-zero) for irrelevant nodes/pairs.
Breaks Softmax Rank Bottleneck: The gate increases effective output rank per head, measurable via stable rank, exceeding the bound imposed by the convex combination of $V$ by row-stochastic softmax.
Gradient Flow Regularization: The bounded sigmoid reduces both vanishing and exploding gradients, improving training robustness across wide hyperparameter settings.

The scheme introduces minimal parameter overhead: an additional projection and bias per head per layer ( $<1.2\%$ overall), with negligible computational cost ( $<$ 3% wall-clock overhead).

Experimental Evaluation

SigGate-GT is evaluated on standard molecular and peptide graph benchmarks (ZINC, ogbg-molhiv, ogbg-molpcba, LRGB/Peptides-func, LRGB/Peptides-struct) and compared to GraphGPS, GRIT, Exphormer, and various MPNN/GNN baselines.

Main Results

ZINC (regression): Matches state-of-the-art (0.059 MAE), matching GRIT and beating GraphGPS by 15.7% (statistically significant, $\text{head}_k = \text{softmax}(QK^\top/\sqrt{d_k})V,$ 0).
ogbg-molhiv: Achieves new SOTA (82.47% ROC-AUC), outperforming all listed baselines.
ogbg-molpcba: Achieves 29.84% AP, modestly surpassing previous SOTA.
LRGB Benchmarks: Competitive with SOTA, beats GraphGPS statistically on both peptidic tasks.

Significance results establish consistent improvements over GraphGPS, with wins or ties against strongest prior methods depending on dataset.

Ablation and Analysis

Extensive ablations validate that the optimal position for gating is post-attention output (G1), and per-head parameterization is key for maximal expressivity. Value gating (G2) and pre-softmax gating (G3) lead to degraded or unstable performance.

Over-smoothing Mitigation

SigGate-GT significantly retards over-smoothing: at 16 layers, Mean Average Distance (MAD) of node representations degrades only to 73% initial value (vs. 51% for GraphGPS). This translates to slower loss in node-level discriminability at depth.

Attention Entropy

SigGate-GT maintains substantially higher attention entropy across layers, indicating that the gating module counteracts attention collapse and prevents heads from converging on uniform or peaked distributions—both detrimental for discriminative capacity.

Training Stability

SigGate-GT's results are stable across a 10 $\text{head}_k = \text{softmax}(QK^\top/\sqrt{d_k})V,$ 1 learning rate range. The maximum difference in MAE across this range is reduced by a factor of 7 compared to the base model.

Gate Statistics

Analysis of learned gates reveals active use of the full $\text{head}_k = \text{softmax}(QK^\top/\sqrt{d_k})V,$ 2 range and head specialization; some gates consistently suppress, others pass through, confirming that the mechanism utilizes its extra degrees of freedom.

Comparison with Other Remedies

Compared to DropEdge and PairNorm, SigGate-GT yields larger improvements on all core benchmarks. Combination with DropEdge yields only marginal additional gains, indicating gating directly addresses the principal over-smoothing mechanism.

Synthetic Stable Rank Validation

Synthetic experiments verify that introducing gating increases effective (stable) rank of attention output by 5–8% across a range of graph sparsities and attention concentrations.

Practical and Theoretical Implications

The SigGate mechanism has immediate value for practitioners working with deep graph transformers. Its ease of integration (sits after SDPA), negligible overhead, and statistically significant, robust improvements across benchmarks affirm its utility as a standard inductive bias for GTs.

Theoretically, the suppression of softmax’s rank bottleneck improves per-head representational capacity, especially in deeper GTs, making it possible to mitigate the spectral degeneracy (over-smoothing) that dominates as layer count increases. The smooth gating operation can be interpreted as a learned bypass for the Laplacian-averaging effect of repeated global attention—an effect previously only partially addressed by architectural normalization or stochastic edge-dropping.

Furthermore, the mechanism's effectiveness on graph data—networks without sequential structure—demonstrates the transferability of recent innovations from language modeling. It suggests attention-gate separation is broadly beneficial for any domain where the relevance of interactions is sparse or highly context-dependent.

Limitations and Future Directions

The main limitation is scope: all results use a GraphGPS backbone; cross-backbone empirical generalization (e.g., to Graphormer or Exphormer with different positional/structural priors) remains unproven. Additionally, all benchmarks are graph-level prediction tasks—how SigGate-GT behaves for node-level or link-prediction tasks is an open empirical question. The depth analysis is confounded by training separate models at each depth. Further, while gates are learned on top of softmax attention, it is possible that other normalization schemes or nonexpansive attention operators could yield complementary or alternative gains.

Future work should focus on adapting SigGate-GT to a broader class of architectures and tasks, develop theoretical connections to the WL hierarchy, and analyze the mechanism’s interplay with positional encoding and local structural priors.

Conclusion

By incorporating per-head, element-wise sigmoid gating at the attention output of graph transformers, SigGate-GT systematically reduces over-smoothing, increases entropy and expressivity of head outputs, and improves training robustness. The mechanism incurs minor parameter and compute overheads but is responsible for statistically significant gains on all investigated graph learning benchmarks, matching or surpassing leading alternatives. The evidence motivates adoption of output gating as a standard architectural element in deep graph transformer design.

Markdown Report Issue