Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) (2304.06815v3)

Published 13 Apr 2023 in cs.SE and cs.LG

Abstract: LLMs (LLM) are a new class of computation engines, "programmed" via prompt engineering. We are still learning how to best "program" these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc. One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of "code analysis" and extracting such information, implicitly, while processing code: but are they, really? If they aren't, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM's prompt with semantic facts explicitly, actually helps. Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization. We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different LLMs. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4998–5007.
  2. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655–2668.
  3. Toufique Ahmed and Premkumar Devanbu. 2022a. Few-shot training LLMs for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
  4. Toufique Ahmed and Premkumar Devanbu. 2022b. Multilingual training for software engineering. In Proceedings of the 44th International Conference on Software Engineering. 1443–1455.
  5. Toufique Ahmed and Premkumar Devanbu. 2023. Better patching using LLM prompting, via Self-Consistency. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1742–1746.
  6. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. ICSE (2023).
  7. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
  8. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400 (2018).
  9. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
  10. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  11. Lionel C Briand. 2003. Software documentation: how much is enough?. In Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings. IEEE, 13–15.
  12. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  13. Boxing Chen and Colin Cherry. 2014. A systematic comparison of smoothing techniques for sentence-level BLEU. In Proceedings of the ninth workshop on statistical machine translation. 362–367.
  14. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  16. Evaluating commit message generation: to BLEU or not to BLEU?. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. 31–35.
  17. Evaluating source code summarization techniques: Replication and expansion. In 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 13–22.
  18. Automated Repair of Programs from Large Language Models. ICSE.
  19. Statistical power analyses using G* Power 3.1: Tests for correlation and regression analyses. Behavior research methods 41, 4 (2009), 1149–1160.
  20. G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods 39, 2 (2007), 175–191.
  21. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1536–1547.
  22. Andrew Forward and Timothy C Lethbridge. 2002. The relevance of software documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM symposium on Document engineering. 26–33.
  23. Jianfeng Gao and Xiaodong He. 2013. Training MRF-based phrase translation models using gradient ascent. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 450–459.
  24. Code to Comment ?Translation?: Data, Metrics, Baselining & Evaluation. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 746–757.
  25. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations.
  26. Supporting program comprehension with source code summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2. 223–226.
  27. On the use of automated text summarization techniques for summarizing source code. In 2010 17th Working Conference on Reverse Engineering. IEEE, 35–44.
  28. Semantic similarity metrics for evaluating source code summarization. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 36–47.
  29. Deep code comment generation. In Proceedings of the 26th conference on program comprehension. 200–210.
  30. Summarizing source code with transferred API knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2269–2275.
  31. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards Reasoning in Large Language Models: A Survey. arXiv preprint arXiv:2212.10403 (2022).
  32. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  33. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083.
  34. Jigsaw: Large language models meet program synthesis. In Proceedings, 44th ICSE. 1219–1231.
  35. Impact of Code Language Models on Automated Program Repair. ICSE (2023).
  36. Repair is nearly generation: Multilingual program repair with llms. arXiv preprint arXiv:2208.11640 (2022).
  37. NUBIA: NeUral Based Interchangeability Assessor for Text Generation. arXiv:2004.14667 [cs.CL]
  38. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. ICSE (2023).
  39. ChatGPT: Jack of all trades, master of none. Information Fusion (2023), 101861.
  40. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022).
  41. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
  42. CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 45th International Conference on Software Engineering, ser. ICSE.
  43. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  44. Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 501–507.
  45. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  46. DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28.
  47. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  48. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In Proceedings, 45th ICSE.
  49. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  50. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  51. Reasoning with Language Model Prompting: A Survey. arXiv preprint arXiv:2212.09597 (2022).
  52. Improving language understanding by generative pre-training. (2018).
  53. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  54. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29–48.
  55. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  56. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th international conference on Software engineering. 390–401.
  57. Reassessing automatic evaluation metrics for code summarization tasks. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1105–1116.
  58. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020).
  59. On the evaluation of neural code summarization. In Proceedings of the 44th International Conference on Software Engineering. 1597–1608.
  60. Repository-level prompt generation for large language models of code. arXiv preprint arXiv:2206.12839 (2022).
  61. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 43–52.
  62. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  63. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  64. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  65. Code generation as a dual task of code summarization. Advances in neural information processing systems 32 (2019).
  66. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
  67. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  68. Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing. arXiv preprint arXiv:2209.14899 (2022).
  69. Retrieval-based neural source code summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1385–1397.
  70. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Toufique Ahmed (26 papers)
  2. Kunal Suresh Pai (2 papers)
  3. Premkumar Devanbu (25 papers)
  4. Earl T. Barr (21 papers)
Citations (66)
X Twitter Logo Streamline Icon: https://streamlinehq.com