Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions (2403.00875v1)
Abstract: Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field.
- Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
- Language modelling for biological sequences–curated datasets and baselines. BioRxiv, 2020.
- Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
- Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225, 2020.
- A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075, 2021.
- Sequence-level mixed sample data augmentation. arXiv preprint arXiv:2011.09039, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
- Improving auto-augment via augmentation-wise weight sharing. Advances in Neural Information Processing Systems, 33:19088–19098, June 2020.
- Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201, 2018.
- Learning more robust features with adversarial training. arXiv preprint arXiv:1804.07757, 2018.
- Fast autoaugment. Advances in Neural Information Processing Systems, 32, 2019.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Automix: Unveiling the power of mixup for stronger classifiers, 2022.
- Harnessing hard mixed samples with decoupled regularizer, 2023.
- Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, 2020.
- Nucleotide augmentation for machine learning-guided protein engineering. Bioinformatics Advances, 3(1):vbac094, 2023.
- Shuffle and learn: unsupervised learning using temporal order verification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 527–544. Springer, 2016.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Aptatrans: a deep neural network for predicting aptamer-protein interaction using pretrained encoders. BMC bioinformatics, 24(1):447, 2023.
- A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
- Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
- Data augmentation using random image cropping and patching for deep cnns. IEEE Transactions on Circuits and Systems for Video Technology, 30(9):2917–2931, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196, 2019.
- Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering, 2021.
- A survey on protein representation learning: Retrospect and prospect. arXiv preprint arXiv:2301.00813, 2022.
- Knowledge distillation improves graph structure augmentation for graph neural networks. Advances in Neural Information Processing Systems, 35:11815–11827, 2022.
- Graphmixup: Improving class-imbalanced node classification by reinforcement mixup and self-supervised context prediction. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 519–535. Springer, 2022.
- Psc-cpi: Multi-scale protein sequence-structure contrasting for efficient and generalizable compound-protein interaction prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
- MAPE-PPI: Towards effective and efficient protein-protein interaction prediction via microenvironment-aware protein embedding. In The Twelfth International Conference on Learning Representations, 2024.
- Improving the physical realism and structural accuracy of protein models by a two-step atomic-level energy minimization. Biophysical journal, 101(10):2525–2534, 2011.
- Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems, 35:35156–35173, 2022.
- Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.