Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing (2402.03396v1)

Published 4 Feb 2024 in cs.SE, cs.AI, cs.CL, cs.CR, and cs.LG

Abstract: The remarkable capability of LLMs in generating high-quality code has drawn increasing attention in the software testing community. However, existing code LLMs often demonstrate unsatisfactory capabilities in generating accurate and complete tests since they were trained on code snippets collected without differentiating between code for testing purposes and other code. In this paper, we present a large-scale dataset UniTSyn, which is capable of enhancing the prowess of LLMs for Unit Test Synthesis. Associating tests with the tested functions is crucial for LLMs to infer the expected behavior and the logic paths to be verified. By leveraging Language Server Protocol, UniTSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language heuristics that tend to be fragile and difficult to scale. It contains 2.7 million focal-test pairs across five mainstream programming languages, making it possible to be utilized for enhancing the test generation ability of LLMs. The details of UniTSyn can be found in Table 1. Our experiments demonstrate that, by building an autoregressive model based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations, resulting in improved generation accuracy and code coverage across all evaluated programming languages. Code and data will be publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Tree-sitter introduction. URL https://tree-sitter.github.io/tree-sitter/.
  2. A transformer-based approach for source code summarization. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4998–5007, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.449. URL https://aclanthology.org/2020.acl-main.449.
  3. A3test: Assertion-augmented automated test case generation. arXiv preprint arXiv:2302.10352, 2023.
  4. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  5. When, how, and why developers (do not) test in their ides. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pp.  179–190, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450336758. doi: 10.1145/2786805.2786843. URL https://doi.org/10.1145/2786805.2786843.
  6. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
  7. Evaluating large language models trained on code, 2021.
  8. Angora: Efficient fuzzing by principled search. In 2018 IEEE Symposium on Security and Privacy (SP), pp.  711–725, 2018. doi: 10.1109/SP.2018.00046.
  9. Matryoshka: Fuzzing deeply nested branches. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, pp.  499–513, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367479.
  10. Coverage-directed differential testing of jvm implementations. SIGPLAN Not., 51(6):85–99, jun 2016. ISSN 0362-1340. doi: 10.1145/2980983.2908095. URL https://doi.org/10.1145/2980983.2908095.
  11. Dijkstra, E. W. Chapter I: Notes on Structured Programming, pp.  1–82. Academic Press Ltd., GBR, 1972. ISBN 0122005503.
  12. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, pp.  2130–2141, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510141. URL https://doi.org/10.1145/3510003.3510141.
  13. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  14. Afl++ combining incremental steps of fuzzing research. In Proceedings of the 14th USENIX Conference on Offensive Technologies, pp.  10–10, 2020.
  15. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  16. Language Server Protocol and Implementation. Springer, 2021.
  17. Unixcoder: Unified cross-modal pre-training for code representation. In ACL, 2022.
  18. Code representation pre-training with complements from program executions, 2023.
  19. Codesearchnet challenge: Evaluating the state of semantic code search, 2020.
  20. Summarizing source code using a neural attention model. In Erk, K. and Smith, N. A. (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2073–2083, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1195. URL https://aclanthology.org/P16-1195.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  23. A neural model for generating natural language summaries of program subroutines, 2019.
  24. Starcoder: may the source be with you!, 2023.
  25. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158.
  26. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016. URL http://arxiv.org/abs/1608.03983.
  27. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  28. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  29. Microsoft. Language server protocol. URL https://microsoft.github.io/language-server-protocol/.
  30. Evidence-based failure prediction. In Andy Oram, G. W. (ed.), Making Software, chapter 23. O’REILLY, 2010.
  31. Learning deep semantics for test completion, 2023.
  32. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023a.
  33. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023b.
  34. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks, 2021.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  36. Cat-lm: Training language models on aligned code and tests, 2023.
  37. The specification language server protocol: A proposal for standardised lsp extensions. arXiv preprint arXiv:2108.02961, 2021.
  38. Integrity: Finding integer errors by targeted fuzzing. In International Conference on Security and Privacy in Communication Systems, pp.  360–380. Springer, 2020.
  39. Valkyrie: Improving fuzzing performance through deterministic techniques. Journal of Systems and Software, 209:111886, 2024.
  40. Code llama: Open foundation models for code, 2023.
  41. An empirical evaluation of using large language models for automated unit test generation, 2023.
  42. Serebryany, K. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev), pp.  157–157, 2016. doi: 10.1109/SecDev.2016.043.
  43. Unit test case generation with transformers and focal context, 2021.
  44. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, 2021.
  45. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  46. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. ACM, jun 2020. doi: 10.1145/3377811.3380429. URL https://doi.org/10.1145%2F3377811.3380429.
  47. The program testing ability of large language models for code. arXiv preprint arXiv:2310.05727, 2023.
  48. A survey of coverage based testing tools. In Proceedings of the 2006 International Workshop on Automation of Software Test, AST ’06, pp.  99–103, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595934081. doi: 10.1145/1138929.1138949. URL https://doi.org/10.1145/1138929.1138949.
  49. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. arXiv preprint arXiv:2302.00288, 2023.
  50. Understanding programs by exploiting (fuzzing) test cases, 2023.
  51. The impact of continuous integration on other software development practices: A large-scale empirical study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp.  60–71, 2017. doi: 10.1109/ASE.2017.8115619.
  52. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.
  53. Software unit test coverage and adequacy. ACM Comput. Surv., 29(4):366–427, dec 1997. ISSN 0360-0300. doi: 10.1145/267580.267590. URL https://doi.org/10.1145/267580.267590.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com