Mastering the Craft of Data Synthesis for CodeLLMs

Published 16 Oct 2024 in cs.SE and cs.AI | (2411.00005v3)

Abstract: LLMs have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.