Studying Vulnerable Code Entities in R (2402.04421v1)
Abstract: Pre-trained Code LLMs (Code-PLMs) have shown many advancements and achieved state-of-the-art results for many software engineering tasks in the past few years. These models are mainly targeted for popular programming languages such as Java and Python, leaving out many other ones like R. Though R has a wide community of developers and users, there is little known about the applicability of Code-PLMs for R. In this preliminary study, we aim to investigate the vulnerability of Code-PLMs for code entities in R. For this purpose, we use an R dataset of code and comment pairs and then apply CodeAttack, a black-box attack model that uses the structure of code to generate adversarial code samples. We investigate how the model can attack different entities in R. This is the first step towards understanding the importance of R token types, compared to popular programming languages (e.g., Java). We limit our study to code summarization. Our results show that the most vulnerable code entity is the identifier, followed by some syntax tokens specific to R. The results can shed light on the importance of token types and help in developing models for code summarization and method name prediction for the R language.
- Characterizing Bugs in Python and R Data Analytics Programs. arXiv preprint arXiv:2306.08632 (2023).
- Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual Training for Software Engineering. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1443–1455. https://doi.org/10.1145/3510003.3510049
- Hemayet Ahmed Chowdhury. 2023. An Empirical Study of API Breaking Changes in Bioconductor. Ph. D. Dissertation. Virginia Tech.
- On the Development and Distribution of R Packages: An Empirical Analysis of the R Ecosystem. In Proceedings of the 2015 European Conference on Software Architecture Workshops (Dubrovnik, Cavtat, Croatia) (ECSAW ’15). Association for Computing Machinery, New York, NY, USA, Article 41, 6 pages. https://doi.org/10.1145/2797433.2797476
- When GitHub Meets CRAN: An Analysis of Inter-Repository Package Dependency Problems. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. 493–504. https://doi.org/10.1109/SANER.2016.12
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. https://doi.org/10.48550/ARXIV.2002.08155
- Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
- UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 7212–7225. https://doi.org/10.18653/v1/2022.acl-long.499
- GraphCodeBERT: Pre-training Code Representations with Data Flow. https://doi.org/10.48550/ARXIV.2009.08366
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. (June 2020).
- Akshita Jha and Chandan K. Reddy. 2023. CodeAttack: Code-Based Adversarial Attacks for Pre-Trained Programming Language Models. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 1670, 9 pages. https://doi.org/10.1609/aaai.v37i12.26739
- Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 474, 12 pages.
- A Closer Look at the Robustness of Vision-and-Language Pre-trained Models. (December 2020). https://www.microsoft.com/en-us/research/publication/a-closer-look-at-the-robustness-of-vision-and-language-pre-trained-models/
- StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1. Curran. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c16a5320fa475530d9583c34fd356ef5-Paper-round1.pdf
- DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2574–2582. https://doi.org/10.1109/CVPR.2016.282
- Evaluating the Design of the R Language. In ECOOP 2012 – Object-Oriented Programming, James Noble (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 104–131.
- An Empirical Comparison of Pre-Trained Models of Source Code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2136–2148. https://doi.org/10.1109/ICSE48619.2023.00180
- Jeroen Ooms. 2013. Possible Directions for Improving Dependency Versioning in R. CoRR abs/1303.2140 (2013). arXiv:1303.2140 http://arxiv.org/abs/1303.2140
- Time-Efficient Code Completion Model for the R Programming Language. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 34–39. https://doi.org/10.18653/v1/2021.nlp4prog-1.4
- Language models are unsupervised multitask learners. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
- BayesOpt Adversarial Attack. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=Hkem-lrtvH
- Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
- Hadley Wickham and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st ed.). O’Reilly Media, Inc.
- Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
- Adversarial Examples for Models of Code. Proc. ACM Program. Lang. 4, OOPSLA, Article 162 (nov 2020), 30 pages. https://doi.org/10.1145/3428230
- Adversarial Robustness of Deep Code Comment Generation. ACM Trans. Softw. Eng. Methodol. 31, 4, Article 60 (jul 2022), 30 pages. https://doi.org/10.1145/3501256