Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws Behind Code Understanding Model (2402.12813v1)

Published 20 Feb 2024 in cs.SE

Abstract: The scaling law is becoming a fundamental law in many machine learning areas. That is, test error falls off with the power law when increasing training data, model size, and computing resource. However, whether this law is suitable for the task of code understanding is not well studied, and most current LLMs for code understanding are about 100M parameters, which are relatively "small" compared to LLMs. In this paper, we conduct extensive experiments to investigate the scaling law for the code understanding task by varying training data, model size, and computing resource. We validate that the test error of code understanding models falls off with the power law when using larger models, indicating that the scaling law is suitable for the code understanding task. Besides, we apply different scales of models to two downstream code understanding tasks, and find that the performance increases with larger scale of models. Finally, we train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin. We will release our code and the CoLSBERT model when our paper is published.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. SantaCoder: don’t reach for the stars! CoRR abs/2301.03988 (2023). https://doi.org/10.48550/arXiv.2301.03988 arXiv:2301.03988
  2. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  3. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  5. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  6. Data and Parameter Scaling Laws for Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 5915–5922. https://doi.org/10.18653/v1/2021.emnlp-main.478
  7. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.). ACM, 933–944. https://doi.org/10.1145/3180155.3180167
  8. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 7212–7225. https://doi.org/10.18653/v1/2022.acl-long.499
  9. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  10. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. CoRR (2024). https://arxiv.org/abs/2401.14196
  11. A Multi-Perspective Architecture for Semantic Code Search. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8563–8568. https://doi.org/10.18653/v1/2020.acl-main.758
  12. Scaling Laws for Autoregressive Generative Modeling. CoRR abs/2010.14701 (2020). arXiv:2010.14701 https://arxiv.org/abs/2010.14701
  13. Scaling Laws for Transfer. CoRR abs/2102.01293 (2021). arXiv:2102.01293 https://arxiv.org/abs/2102.01293
  14. Training Compute-Optimal Large Language Models. CoRR abs/2203.15556 (2022). https://doi.org/10.48550/arXiv.2203.15556 arXiv:2203.15556
  15. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
  16. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). arXiv:2001.08361 https://arxiv.org/abs/2001.08361
  17. Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code?. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 1332–1336. https://doi.org/10.1109/ASE51524.2021.9678927
  18. The Stack: 3 TB of permissively licensed source code. CoRR abs/2211.15533 (2022). https://doi.org/10.48550/arXiv.2211.15533 arXiv:2211.15533
  19. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  20. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161
  21. Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 118–129. https://aclanthology.org/2022.findings-emnlp.9
  22. Deep Graph Matching and Searching for Semantic Code Retrieval. ACM Trans. Knowl. Discov. Data 15, 5 (2021), 88:1–88:21. https://doi.org/10.1145/3447571
  23. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  24. MulCS: Towards a Unified Deep Representation for Multilingual Code Search. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, Tao Zhang, Xin Xia, and Nicole Novielli (Eds.). IEEE, 120–131. https://doi.org/10.1109/SANER56733.2023.00021
  25. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 1287–1293. https://doi.org/10.1609/AAAI.V30I1.10139
  26. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=iaYcJKpY2B_
  27. SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1–13. https://doi.org/10.1145/3510003.3510096
  28. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774
  29. Training language models to follow instructions with human feedback. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
  30. Improving language understanding by generative pre-training. (2018).
  31. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  32. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  33. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://proceedings.mlr.press/v139/ramesh21a.html
  34. Code Llama: Open Foundation Models for Code. arXiv preprint arXiv:2308.12950 (2023).
  35. Beyond neural scaling laws: beating power law scaling via data pruning. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html
  36. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6 (2023), 109:1–109:28. https://doi.org/10.1145/3530811
  37. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971
  38. CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 1066–1077. https://doi.org/10.18653/v1/2022.findings-naacl.80
  39. SyncoBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model. CoRR abs/2108.04556 (2021). arXiv:2108.04556
  40. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
  41. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 10524–10533. http://proceedings.mlr.press/v119/xiong20b.html
  42. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2203–2214. https://doi.org/10.1145/3308558.3313632
  43. deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search. ACM Trans. Softw. Eng. Methodol. 32, 2 (2023), 34:1–34:27. https://doi.org/10.1145/3546066
  44. A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
  45. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. CoRR abs/2303.17568 (2023). https://doi.org/10.48550/arXiv.2303.17568 arXiv:2303.17568
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiayi Lin (14 papers)
  2. Hande Dong (9 papers)
  3. Yutao Xie (10 papers)
  4. Lei Zhang (1689 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com