Papers
Topics
Authors
Recent
2000 character limit reached

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models (2308.16458v5)

Published 31 Aug 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Pre-trained LLMs have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1,026 Python functions and 1,243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT- 4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> 2,600 tokens) with full context, including functional dependencies. (2) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% vs. up to 25%). Availability and implementation: Code is available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark. github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  3. Multi-lingual evaluation of code generation models, 2023.
  4. Program synthesis with large language models. arXiv preprint, 2021a.
  5. Program synthesis with large language models, 2021b.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  7. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7(OOPSLA1):85–111, 2023.
  8. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  9. On the opportunities and risks of foundation models, 2022.
  10. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  11. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029, 2023.
  12. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, pp.  1–17, 2023. doi: 10.1109/TSE.2023.3267446. URL https://arxiv.org/abs/2208.08227.
  13. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
  14. Evaluating large language models trained on code, 2021.
  15. Teaching large language models to self-debug, 2023.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. PanGu-Coder: program synthesis with function-level language modeling. arXiv preprint arXiv:2207.11280, 2022. doi: 10.48550/ARXIV.2207.11280. URL https://arxiv.org/abs/2207.11280.
  18. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
  19. InCoder: a generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022. doi: 10.48550/ARXIV.2204.05999. URL https://arxiv.org/abs/2204.05999.
  20. Incoder: A generative model for code infilling and synthesis, 2023.
  21. Eleutherai/lm-evaluation-harness: v0.3.0, December 2022. URL https://doi.org/10.5281/zenodo.7413426.
  22. Measuring coding challenge competence with apps. NeurIPS, 2021.
  23. Training compute-optimal large language models, 2022.
  24. A hazard analysis framework for code synthesis large language models. arXiv preprint arXiv:2207.14157, 2022.
  25. Unsupervised translation of programming languages, 2020.
  26. DS-1000: a natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
  27. Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023.
  28. Starcoder: may the source be with you!, 2023.
  29. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
  30. Repobench: Benchmarking repository-level code auto-completion systems, 2023.
  31. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  32. Efficient estimation of word representations in vector space, 2013.
  33. Measuring data, 2023.
  34. huggingface/tokenizers: Rust 0.13.2, November 2022. URL https://doi.org/10.5281/zenodo.7298413.
  35. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a.
  36. Crosslingual generalization through multitask finetuning, 2023b.
  37. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint, 2023a.
  38. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=iaYcJKpY2B_.
  39. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
  40. OpenAI. Gpt-4 technical report, 2023.
  41. Measuring the impact of programming language distribution. In International Conference on Machine Learning, pp.  26619–26645. PMLR, 2023.
  42. Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109, 2023.
  43. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  44. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  45. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  46. A large-scale analysis of bioinformatics code on github. PloS one, 13(10):e0205898, 2018.
  47. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
  48. Repofusion: Training code models to understand your repository, 2023.
  49. Lamda: Language models for dialog applications, 2022.
  50. LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  51. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022a.
  52. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, 2021.
  53. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023a.
  54. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.
  55. Mconala: A benchmark for code generation from multiple natural languages, 2023b.
  56. Natural language generation and understanding of big code for ai-assisted programming: A review. Entropy, 25(6):888, 2023.
  57. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
  58. What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932, 2023.
  59. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. arXiv preprint, 2023.
  60. Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240, 2023.
  61. When language model meets private library. In Conference on Empirical Methods in Natural Language Processing, 2022a.
  62. Cert: Continual pre-training on sketches for library-oriented code generation, 2022b.
  63. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7443–7464, 2023.
  64. GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  65. Can transformers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023b.
  66. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on HumanEval-X. arXiv preprint arXiv:2303.17568, 2023a. doi: 10.48550/arXiv.2303.17568.
  67. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023b.
  68. Terry Yue Zhuo. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.