CAT-LM: Training Language Models on Aligned Code And Tests (2310.01602v1)
Abstract: Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. LLMs trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code-under-test context when producing a test file. In this work, we propose the Aligned Code And Tests LLM (CAT-LM), a GPT-style LLM with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger LLMs trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training LLMs for code and paves the way to more powerful automated test generation.
- M. Beller, G. Gousios, A. Panichella, and A. Zaidman, “When, how, and why developers (do not) test in their ides,” in Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’15, 2015, p. 179–190.
- M. Beller, G. Gousios, and A. Zaidman, “How (much) do developers test?” in International Conference on Software Engineering, ser. ICSE ’15, 2015, p. 559–562.
- E. Dinella, G. Ryan, T. Mytkowicz, and S. Lahiri, “Toga: A neural method for test oracle generation,” in International Conference on Software Engineering, ser. ICSE ’22, 2022, p. 2130–2141.
- G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” in Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’11, 2011, p. 416–419.
- C. Brandt and A. Zaidman, “Developer-centric test amplification: The interplay between automatic generation human exploration,” Empirical Software Engineering, vol. 27, no. 4, 2022.
- R. Baldoni, E. Coppa, D. C. D’Elia, C. Demetrescu, and I. Finocchi, “A Survey of Symbolic Execution Techniques,” ACM Computing Survey, vol. 51, no. 3, pp. 50–88, 2018.
- C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, “On learning meaningful assert statements for unit test cases,” CoRR, vol. abs/2002.05800, 2020.
- J. Villmow, J. Depoix, and A. Ulges, “ConTest: A Unit Test Completion Benchmark featuring Context,” in Workshop on Natural Language Processing for Programming, Aug. 2021, pp. 17–25.
- A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn, “Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities,” in International Conference on Software Maintenance and Evolution, 2020, pp. 523–533.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. A. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating Large Language Models Trained on Code,” CoRR, vol. abs/2107.03374, 2021.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “A Conversational Paradigm for Program Synthesis,” CoRR, vol. abs/2203.13474, 2022.
- D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “InCoder: A Generative Model for Code Infilling and Synthesis,” CoRR, vol. abs/2204.05999, 2022.
- M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen, “Efficient Training of Language Models to Fill in the Middle,” CoRR, vol. abs/2207.14255, 2022.
- “GitHub Copilot,” 2021. [Online]. Available: https://github.com/features/copilot
- P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, and M. Gligoric, “Learning deep semantics for test completion,” in International Conference on Software Engineering, ser. ICSE ’23, 2023, p. 2111–2123.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be with you!” 2023.
- L. v. Werra, “Codeparrot,” https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot.
- “SentencePiece.” [Online]. Available: https://github.com/google/sentencepiece
- J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” CoRR, vol. arXiv:2203.15556, 2022.
- “GPT-neox Toolkit.” [Online]. Available: https://github.com/EleutherAI/gpt-neox
- T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” in Advances in Neural Information Processing Systems, 2022.
- “GitHub REST API.” [Online]. Available: https://docs.github.com/en/rest
- M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, ser. SPLASH ’19, 2019, pp. 143–153.
- C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek, “Déjàvu: a map of code duplicates on GitHub,” in Proceedings of the ACM on Programming Languages, ser. OOPSLA ’17, vol. 1, 2017, pp. 1–28.
- “TheFuzz: Fuzzy String Matching in Python.” [Online]. Available: https://github.com/seatgeek/thefuzz
- P. S. Kochhar, T. F. Bissyandé, D. Lo, and L. Jiang, “An empirical study of adoption of software testing in open source projects,” in International Conference on Quality Software, ser. ICQS ’13, 2013, pp. 103–112.
- H. H. F. d. Souza, I. Wiese, I. Steinmacher, and R. Ré, “A characterization study of testing contributors and their contributions in open source projects,” in Brazilian Symposium on Software Engineering, ser. SBES ’22, 2022, pp. 95–105.
- T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Annual Meeting of the Association for Computational Linguistics, ser. ACL ’18, 2018, pp. 66–75.
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A Systematic Evaluation of Large Language Models of Code,” CoRR, vol. abs/2202.13169, 2022.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models,” CoRR, vol. abs/2302.13971, 2023.
- C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Conference on Text Summarization Branches Out, 2004, pp. 74–81.
- S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “CodeBLEU: a Method for Automatic Evaluation of Code Synthesis,” CoRR, vol. abs/2009.10297, 2020.
- X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation with Hybrid lexical and syntactical information,” Empirical Software Engineering, vol. 25, no. 3, pp. 2179–2217, 2020.
- A. LeClair, S. Jiang, and C. McMillan, “A Neural model for generating natural language summaries of program subroutines,” in International Conference on Software Engineering, ser. ICSE ’19, 2019, pp. 795–806.
- Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation,” in Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” CoRR, vol. abs/2102.04664, 2021.
- OpenAI, “Gpt-4 technical report,” 2023.
- C. Pacheco and M. D. Ernst, “Randoop: Feedback-directed random testing for Java,” in Conference on Object-Oriented Programming Systems and Applications Companion, ser. OOPSLA ’07, 2007, pp. 815–816.
- A. Fioraldi, D. Maier, H. Eißfeldt, and M. Heuse, “AFL++: Combining incremental steps of fuzzing research,” in Conference on Offensive Technologies, ser. WOOT ’20, 2020, pp. 10–10.
- C. Boyapati, S. Khurshid, and D. Marinov, “Korat: Automated testing based on Java predicates,” SIGSOFT Software Engineering Notes, vol. 27, no. 4, pp. 123–133, 2002.
- K. Claessen and J. Hughes, “Quickcheck: A lightweight tool for random testing of haskell programs,” in International Conference on Functional Programming, ser. ICFP ’00, 2000, p. 268–279.
- D. MacIver, Z. Hatfield-Dodds, and M. Contributors, “Hypothesis: A New Approach to property-based testing,” Journal of Open Source Software, vol. 4, no. 43, p. 1891, 2019.
- N. Tillmann and P. de Halleux, “Pex - white box test generation for .net,” in Tests and Proofs, ser. TAP ’08, vol. 4966, April 2008, pp. 134–153.
- J. Choi, J. Jang, C. Han, and S. K. Cha, “Grey-box concolic testing on binary code,” in International Conference on Software Engineering, ser. ICSE ’19, 2019, pp. 736–747.
- E. Daka, J. M. Rojas, and G. Fraser, “Generating unit tests with descriptive names or: Would you name your children Thing1 and Thing2?” in International Symposium on Software Testing and Analysis, ser. ISSTA ’17, 2017, pp. 57–67.
- B. Robinson, M. D. Ernst, J. H. Perkins, V. Augustine, and N. Li, “Scaling up automated test generation: Automatically generating maintainable regression unit tests for programs,” in Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ser. ASE ’11, 2011, pp. 23–32.
- R. White and J. Krinke, “Reassert: Deep learning for assert generation,” CoRR, vol. abs/2011.09784, 2020.
- M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers,” CoRR, vol. abs/2009.05617, 2020.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are Few-Shot learners,” in Advances in Neural Information Processing Systems, 2020, pp. 1877–1901.
- M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive Test Generation Using a Large Language Model,” CoRR, vol. abs/2302.06527, 2023.
- Nikitha Rao (12 papers)
- Kush Jain (8 papers)
- Uri Alon (40 papers)
- Claire Le Goues (34 papers)
- Vincent J. Hellendoorn (16 papers)