Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model (2311.16267v2)
Abstract: We present four main contributions to enhance the performance of LLMs in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.
- “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “Efficient streaming language models with attention sinks,” arXiv preprint arXiv:2309.17453, 2023.
- “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
- “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
- “Successive prompting for decomposing complex questions,” arXiv preprint arXiv:2212.04092, 2022.
- “Chateda: A large language model powered autonomous agent for eda,” in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD). IEEE, 2023, pp. 1–6.
- “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023.
- “Adaptive test generation using a large language model,” arXiv preprint arXiv:2302.06527, 2023.
- “Verigen: A large language model for verilog code generation,” arXiv preprint arXiv:2308.00708, 2023.
- “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
- “Show your work: Scratchpads for intermediate computation with language models,” arXiv preprint arXiv:2112.00114, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.