Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Published 9 May 2023 in cs.CL, cs.AI, cs.PL, and cs.SE | (2305.06156v2)

Abstract: We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training LLMs to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code LLMs on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.

Citations (12)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.