Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Studying Vulnerable Code Entities in R (2402.04421v1)

Published 6 Feb 2024 in cs.SE and cs.AI

Abstract: Pre-trained Code LLMs (Code-PLMs) have shown many advancements and achieved state-of-the-art results for many software engineering tasks in the past few years. These models are mainly targeted for popular programming languages such as Java and Python, leaving out many other ones like R. Though R has a wide community of developers and users, there is little known about the applicability of Code-PLMs for R. In this preliminary study, we aim to investigate the vulnerability of Code-PLMs for code entities in R. For this purpose, we use an R dataset of code and comment pairs and then apply CodeAttack, a black-box attack model that uses the structure of code to generate adversarial code samples. We investigate how the model can attack different entities in R. This is the first step towards understanding the importance of R token types, compared to popular programming languages (e.g., Java). We limit our study to code summarization. Our results show that the most vulnerable code entity is the identifier, followed by some syntax tokens specific to R. The results can shed light on the importance of token types and help in developing models for code summarization and method name prediction for the R language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Characterizing Bugs in Python and R Data Analytics Programs. arXiv preprint arXiv:2306.08632 (2023).
  2. Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual Training for Software Engineering. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1443–1455. https://doi.org/10.1145/3510003.3510049
  3. Hemayet Ahmed Chowdhury. 2023. An Empirical Study of API Breaking Changes in Bioconductor. Ph. D. Dissertation. Virginia Tech.
  4. On the Development and Distribution of R Packages: An Empirical Analysis of the R Ecosystem. In Proceedings of the 2015 European Conference on Software Architecture Workshops (Dubrovnik, Cavtat, Croatia) (ECSAW ’15). Association for Computing Machinery, New York, NY, USA, Article 41, 6 pages. https://doi.org/10.1145/2797433.2797476
  5. When GitHub Meets CRAN: An Analysis of Inter-Repository Package Dependency Problems. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. 493–504. https://doi.org/10.1109/SANER.2016.12
  6. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. https://doi.org/10.48550/ARXIV.2002.08155
  7. Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  8. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 7212–7225. https://doi.org/10.18653/v1/2022.acl-long.499
  9. GraphCodeBERT: Pre-training Code Representations with Data Flow. https://doi.org/10.48550/ARXIV.2009.08366
  10. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. (June 2020).
  11. Akshita Jha and Chandan K. Reddy. 2023. CodeAttack: Code-Based Adversarial Attacks for Pre-Trained Programming Language Models. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 1670, 9 pages. https://doi.org/10.1609/aaai.v37i12.26739
  12. Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 474, 12 pages.
  13. A Closer Look at the Robustness of Vision-and-Language Pre-trained Models. (December 2020). https://www.microsoft.com/en-us/research/publication/a-closer-look-at-the-robustness-of-vision-and-language-pre-trained-models/
  14. StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
  15. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  16. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1. Curran. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c16a5320fa475530d9583c34fd356ef5-Paper-round1.pdf
  17. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2574–2582. https://doi.org/10.1109/CVPR.2016.282
  18. Evaluating the Design of the R Language. In ECOOP 2012 – Object-Oriented Programming, James Noble (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 104–131.
  19. An Empirical Comparison of Pre-Trained Models of Source Code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2136–2148. https://doi.org/10.1109/ICSE48619.2023.00180
  20. Jeroen Ooms. 2013. Possible Directions for Improving Dependency Versioning in R. CoRR abs/1303.2140 (2013). arXiv:1303.2140 http://arxiv.org/abs/1303.2140
  21. Time-Efficient Code Completion Model for the R Programming Language. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 34–39. https://doi.org/10.18653/v1/2021.nlp4prog-1.4
  22. Language models are unsupervised multitask learners. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
  23. BayesOpt Adversarial Attack. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=Hkem-lrtvH
  24. Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
  25. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
  26. Hadley Wickham and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st ed.). O’Reilly Media, Inc.
  27. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
  28. Adversarial Examples for Models of Code. Proc. ACM Program. Lang. 4, OOPSLA, Article 162 (nov 2020), 30 pages. https://doi.org/10.1145/3428230
  29. Adversarial Robustness of Deep Code Comment Generation. ACM Trans. Softw. Eng. Methodol. 31, 4, Article 60 (jul 2022), 30 pages. https://doi.org/10.1145/3501256
Citations (1)

Summary

We haven't generated a summary for this paper yet.