Scaling Laws Behind Code Understanding Model (2402.12813v1)
Abstract: The scaling law is becoming a fundamental law in many machine learning areas. That is, test error falls off with the power law when increasing training data, model size, and computing resource. However, whether this law is suitable for the task of code understanding is not well studied, and most current LLMs for code understanding are about 100M parameters, which are relatively "small" compared to LLMs. In this paper, we conduct extensive experiments to investigate the scaling law for the code understanding task by varying training data, model size, and computing resource. We validate that the test error of code understanding models falls off with the power law when using larger models, indicating that the scaling law is suitable for the code understanding task. Besides, we apply different scales of models to two downstream code understanding tasks, and find that the performance increases with larger scale of models. Finally, we train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin. We will release our code and the CoLSBERT model when our paper is published.
- SantaCoder: don’t reach for the stars! CoRR abs/2301.03988 (2023). https://doi.org/10.48550/arXiv.2301.03988 arXiv:2301.03988
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
- Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
- Data and Parameter Scaling Laws for Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 5915–5922. https://doi.org/10.18653/v1/2021.emnlp-main.478
- Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.). ACM, 933–944. https://doi.org/10.1145/3180155.3180167
- UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 7212–7225. https://doi.org/10.18653/v1/2022.acl-long.499
- GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. CoRR (2024). https://arxiv.org/abs/2401.14196
- A Multi-Perspective Architecture for Semantic Code Search. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8563–8568. https://doi.org/10.18653/v1/2020.acl-main.758
- Scaling Laws for Autoregressive Generative Modeling. CoRR abs/2010.14701 (2020). arXiv:2010.14701 https://arxiv.org/abs/2010.14701
- Scaling Laws for Transfer. CoRR abs/2102.01293 (2021). arXiv:2102.01293 https://arxiv.org/abs/2102.01293
- Training Compute-Optimal Large Language Models. CoRR abs/2203.15556 (2022). https://doi.org/10.48550/arXiv.2203.15556 arXiv:2203.15556
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
- Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). arXiv:2001.08361 https://arxiv.org/abs/2001.08361
- Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code?. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 1332–1336. https://doi.org/10.1109/ASE51524.2021.9678927
- The Stack: 3 TB of permissively licensed source code. CoRR abs/2211.15533 (2022). https://doi.org/10.48550/arXiv.2211.15533 arXiv:2211.15533
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
- StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161
- Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 118–129. https://aclanthology.org/2022.findings-emnlp.9
- Deep Graph Matching and Searching for Semantic Code Retrieval. ACM Trans. Knowl. Discov. Data 15, 5 (2021), 88:1–88:21. https://doi.org/10.1145/3447571
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- MulCS: Towards a Unified Deep Representation for Multilingual Code Search. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, Tao Zhang, Xin Xia, and Nicole Novielli (Eds.). IEEE, 120–131. https://doi.org/10.1109/SANER56733.2023.00021
- Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 1287–1293. https://doi.org/10.1609/AAAI.V30I1.10139
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=iaYcJKpY2B_
- SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1–13. https://doi.org/10.1145/3510003.3510096
- OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774
- Training language models to follow instructions with human feedback. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
- Improving language understanding by generative pre-training. (2018).
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
- Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://proceedings.mlr.press/v139/ramesh21a.html
- Code Llama: Open Foundation Models for Code. arXiv preprint arXiv:2308.12950 (2023).
- Beyond neural scaling laws: beating power law scaling via data pruning. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html
- Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6 (2023), 109:1–109:28. https://doi.org/10.1145/3530811
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971
- CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 1066–1077. https://doi.org/10.18653/v1/2022.findings-naacl.80
- SyncoBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model. CoRR abs/2108.04556 (2021). arXiv:2108.04556
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
- On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 10524–10533. http://proceedings.mlr.press/v119/xiong20b.html
- CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2203–2214. https://doi.org/10.1145/3308558.3313632
- deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search. ACM Trans. Softw. Eng. Methodol. 32, 2 (2023), 34:1–34:27. https://doi.org/10.1145/3546066
- A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. CoRR abs/2303.17568 (2023). https://doi.org/10.48550/arXiv.2303.17568 arXiv:2303.17568
- Jiayi Lin (14 papers)
- Hande Dong (9 papers)
- Yutao Xie (10 papers)
- Lei Zhang (1689 papers)