Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data (2312.02418v1)

Published 5 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of LLMs optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

The research presented in this paper tackles the challenge of improving the quality of training data for LLMs, specifically those optimized for code generation. LLMs' capacity to generate code has garnered significant attention due to its potential to revolutionize software development. However, the performance and efficiency of LLMs heavily depend on the quality of their training data. Datasets compiled from public sources, such as GitHub, often contain inconsistencies, errors, or low-quality snippets of code which can negatively impact the training of these models.

To address this, the paper introduces a novel approach known as Synthetic Corruption Informed Pruning (SCIP), which enhances the dataset quality by removing low-quality code. This is achieved by first introducing synthetic corruptions or controlled errors into the code to generate an embedding space that reflects characteristics of lower-quality data. Synthetic corruption intentionally introduces syntax errors, like removing closed brackets, or content errors, like altering array indices, to craft a clear distinction between high and low-quality code snippets. The affected code tends to group into smaller clusters or further from cluster centroids in an embedding space created by a pre-trained model, known as StarEncoder.

The SCIP method operates by examining the size of clusters and the distance of data points to cluster centroids within the embedding space, targeting data that resemble the synthetically corrupted code in terms of their spatial properties. The paper shows that by pruning code snippets that fall into smaller clusters or are farther away from cluster centroids, the resulting cleaned datasets produce enhanced performance of LLMs on widely recognized code generation benchmarks, namely HumanEval and MBPP.

Results from this pruning strategy show that it not only improves the performance on benchmark evaluations but also achieves better training efficiency, with models requiring fewer training steps to reach baseline performance levels. The method has proven to surpass existing embedding-based pruning methods in both performance and training efficiency.

The implications of this research extend beyond code datasets. It illustrates the importance of rigorously examining and curating training data for AI models. The idea of using synthetically corrupted data as a signal for pruning could be applicable to a broader range of datasets, including those used for natural language processing tasks. This work opens the door for future studies to develop improved data pruning techniques that utilize synthetic corruption insights for various types of AI models, potentially leading to more accurate, reliable, and effective AI applications across different domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Gpts are gpts: An early look at the labor market impact potential of large language models, 2023.
  2. Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022.
  3. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  4. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  5. OpenAI. Gpt-4 technical report, 2023.
  6. The curse of recursion: Training on generated data makes models forget, 2023.
  7. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  8. Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  10. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  11. Starcoder: may the source be with you! 2023.
  12. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  13. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  14. FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
  15. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284, 2023.
  16. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  17. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  18. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yu Yang (213 papers)
  2. Aaditya K. Singh (14 papers)
  3. Mostafa Elhoushi (22 papers)
  4. Anas Mahmoud (12 papers)
  5. Kushal Tirumala (17 papers)
  6. Fabian Gloeckle (5 papers)
  7. Baptiste Rozière (99 papers)
  8. Carole-Jean Wu (62 papers)
  9. Ari S. Morcos (31 papers)
  10. Newsha Ardalani (17 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com