Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models (2401.00788v1)

Published 1 Jan 2024 in cs.CL, cs.AI, and cs.SE
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models

Abstract: The high cost of full-parameter fine-tuning (FFT) of LLMs has led to a series of parameter-efficient fine-tuning (PEFT) methods. However, it remains unclear which methods provide the best cost-performance trade-off at different model scales. We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters. Through investigations across 5 tasks and 8 different datasets encompassing both code comprehension and code generation tasks, we find that FFT generally leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale. LoRA usually offers the most favorable trade-off between cost and performance. Further investigation into the effects of these methods on both model robustness and code security reveals that larger models tend to demonstrate reduced robustness and less security. At last, we explore the relationships among updated parameters, cross-entropy loss, and task performance. We find that the tuning effectiveness observed in small models generalizes well to larger models, and the validation loss in instruction tuning can be a reliable indicator of overall downstream performance.

Introduction to Parameter-Efficient Tuning of LLMs

The evolution of LLMs in software engineering has led to enhanced performance in tasks such as code comprehension and code generation. Current advancements point towards instruction-tuned Code LLMs that are tailored to understand human instructions and perform across a variety of tasks without specific task-oriented fine-tuning. However, as models become larger, fully fine-tuning every parameter (FFT) becomes prohibitively costly, pushing the field towards more efficient strategies, namely Parameter-Efficient Fine-Tuning (PEFT) methods. This paper evaluates these PEFT methods across different model scales to determine their impact on model performance, robustness, and security.

Analyzing the PEFT Methods

Researchers developed Astraios, a framework featuring 28 instruction-tuned models based on the OctoCoder model with up to 16 billion parameters. This set includes adjustments using 7 different PEFT methods. Several tasks, including code generation and code comprehension, were tested on multiple datasets to meticulously evaluate the models. The findings indicate FFT tends to outperform PEFT at scale, yet efficiency varies by model size, with LoRA often presenting as the optimal balance between cost and effectiveness.

Model Scaling and Fine-Tuning Impact

Interestingly, larger models excel in code generation tasks but do not extend the same pattern to code comprehension. Moreover, these sizable models are prone to decreased robustness and heightened security vulnerabilities, which suggests larger instruction-tuned Code LLMs face a trade-off between generating high-quality code and staying secure and reliable against adversarial inputs. The researchers also observed a strong correlation between tuning validation loss and downstream performance, indicating that tuning loss can serve as a proxy for the model's broader capabilities.

Model Robustness and Security

Beyond task execution efficiency, the paper underscores the significance of model robustness and security. Evaluation with perturbed data and security-focused benchmarks revealed that models with fewer updated parameters can sometimes offer greater robustness. However, an increase in model size correlates with diminishing robustness and a tendency to generate insecure code more frequently.

Concluding Thoughts

The paper's exploratory journey through model fine-tuning emphasizes the intricate relationships among size, costs, performance, robustness, and security. With a comprehensive model suite, Astraios enables an in-depth understanding of these dynamics and provides critical insights into the path forward in developing more sophisticated and reliable Code LLMs.

Acknowledgements and Contributions

The research benefited from contributions and support from numerous institutions, individuals, and the community, fostering collaborations that span across academia and industry, highlighting the collective effort in the advancement of AI and machine learning in software engineering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328.
  2. Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728.
  3. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  5. Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering, 28(6):1–24.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609.
  7. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.
  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  9. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In International Conference on Machine Learning, pages 896–907. PMLR.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  11. Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
  12. Revisiting Parameter-Efficient Tuning: Are We Really There Yet? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2612–2626.
  13. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  15. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734.
  16. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  17. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904.
  18. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235.
  19. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650.
  20. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations.
  21. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807.
  22. Robust Transfer Learning with Pretrained Language Models through Adapters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 854–861.
  23. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations.
  24. Parameter-efficient model adaptation for vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 817–825.
  25. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
  26. Semantic robustness of models of source code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 526–537. IEEE.
  27. Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
  28. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  29. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620.
  30. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  31. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  32. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv preprint arXiv:2304.01933.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  34. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
  35. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  36. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  37. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  38. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  39. GPT understands, too. AI Open.
  40. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  41. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568.
  42. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.
  43. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119.
  44. Inverse Scaling: When Bigger Isn’t Better. arXiv preprint arXiv:2306.09479.
  45. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124.
  46. Scaling Data-Constrained Language Models. arXiv preprint arXiv:2305.16264.
  47. Crosslingual Generalization through Multitask Finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  48. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  50. Understanding the Effectiveness of Large Language Models in Code Translation. arXiv preprint arXiv:2308.03109.
  51. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE.
  52. True few-shot learning with language models. Advances in Neural Information Processing Systems, 34:11054–11070.
  53. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673.
  54. Language models are unsupervised multitask learners.
  55. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  56. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936.
  57. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237.
  58. Jeffrey Svajlenko and Chanchal K Roy. 2021. Bigclonebench. Code Clone Analysis: Research, Tools, and Practices, pages 93–105.
  59. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
  60. Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 371–383.
  61. What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering, pages 2377–2388.
  62. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 382–394.
  63. ReCode: Robustness Evaluation of Code Generation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13818–13843, Toronto, Canada. Association for Computational Linguistics.
  64. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  65. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708.
  66. Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. In The Eleventh International Conference on Learning Representations.
  67. Inverse scaling can become u-shaped. arXiv preprint arXiv:2211.02011.
  68. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341.
  69. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  70. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  71. Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246.
  72. Training Trajectories of Language Models Across Scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, Toronto, Canada. Association for Computational Linguistics.
  73. OpenAgents: An Open Platform for Language Agents in the Wild. CoRR, abs/2310.10634.
  74. Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. arXiv preprint arXiv:2311.00871.
  75. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9.
  76. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. In The Eleventh International Conference on Learning Representations.
  77. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  78. A survey of large language models. arXiv preprint arXiv:2303.18223.
  79. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684.
  80. Making parameter-efficient tuning more efficient: A unified framework for classification tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 7053–7064.
  81. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32.
  82. Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names? arXiv preprint arXiv:2309.07804.
  83. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2.
  84. Data Augmentation Approaches for Source Code Models: A Survey. arXiv preprint arXiv:2305.19915.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Terry Yue Zhuo (32 papers)
  2. Armel Zebaze (8 papers)
  3. Nitchakarn Suppattarachai (1 paper)
  4. Leandro von Werra (19 papers)
  5. Harm de Vries (29 papers)
  6. Qian Liu (252 papers)
  7. Niklas Muennighoff (56 papers)
Citations (10)