Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

Published 19 Jun 2024 in cs.LG | (2406.13137v1)

Abstract: Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) \textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (\textbf{increases efficiency}); (ii) \textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (\textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process. The code is in:https://github.com/YL-wang/GraphSAM/tree/graphsam

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces GraphSAM, which approximates SAM's perturbation gradient to significantly reduce computational overhead.
It maintains model generalization by constraining the perturbation within the expected loss landscape while preserving SAM's performance.
Extensive tests show up to 155.4% training speed improvements across six benchmark datasets in molecular property prediction.

Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

Introduction

The paper "Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models" presents GraphSAM, a novel optimization algorithm designed to address the computational inefficiency issues inherent in Sharpness-Aware Minimization (SAM) when applied to molecular graph transformer models. SAM, although effective in minimizing sharp local minima and improving generalization, suffers from increased time overhead due to its requirement for dual gradient computations per optimization step. This paper proposes GraphSAM to enhance computational efficiency while preserving the superior generalization capabilities of SAM.

Sharp Local Minima and SAM

Transformers for molecular property prediction often converge to sharp local minima due to their over-parameterization and absence of hand-crafted features. This leads to substantial generalization errors. SAM combats this by perturbing model weights to maximize sharpness within a defined neighborhood, necessitating computation of both adversarial (perturbation) and updating gradients. While effective, this approach yields a doubled computational cost, motivating the need for a more efficient strategy.

GraphSAM: Design and Functionality

GraphSAM introduces an efficient mechanism to approximate the perturbation gradient using the previously computed updating gradient. The algorithm performs the following key operations:

Gradient Approximation: GraphSAM uses the updating gradient from the prior step to approximate the perturbation gradient, reducing computational redundancy.
Loss Landscape Constraint: It ensures that the approximated perturbation does not deviate significantly from the SAM’s expected loss landscape, thereby maintaining model generalization.

GraphSAM effectively re-anchors perturbation gradients intermittently, rolling over the computationally expensive SAM to provide comparable performances with conventional optimizers.

Figure 1: Illustration on the observation of gradient variation during training on GROVER with three datasets.

Performance and Results

Extensive experiments were conducted on six benchmark datasets, including both classification and regression tasks. The evaluation demonstrated the superiority of GraphSAM over baseline models that utilize only the base optimizer, such as Adam. Notably, GraphSAM matched SAM in generalization performance while achieving significant training speed improvements—up to a 155.4% increase compared to SAM.

Figure 2: Accuracy-Training Time of different models for GraphSAM-K. Average Time (s/epoch) represents the average time consumed for each epoch.

Implementation Considerations

When implementing GraphSAM, researchers should consider:

Hyperparameters: The smoothing parameter $\beta$ , initial $\rho$ , scheduler’s modification scale $\gamma$ , and $\rho$ 's update rate $\lambda$ need careful tuning for optimal performance.
Computational Concerns: While GraphSAM mitigates double overhead, the requirement of re-anchoring necessitates moderately increased computation at each epoch start, balancing efficiency with accuracy.

Conclusion

GraphSAM effectively diminishes the computational inefficiency of SAM without compromising its advantages in smoothing sharp local minima. By leveraging gradient approximation, it stands as a significant advancement for deploying graph transformer models in molecular property predictions. Further investigations could explore its adaptability across different neural architectures and optimization frameworks, potentially extending its applicability to broader AI applications.

Markdown Report Issue