Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StarCoder 2 and The Stack v2: The Next Generation (2402.19173v1)

Published 29 Feb 2024 in cs.SE and cs.AI
StarCoder 2 and The Stack v2: The Next Generation

Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

StarCoder 2 and The Stack v2: Advancing the Frontiers of Code Generation LLMs

The BigCode project, an open scientific collaboration focused on the responsible development of LLMs for Code (Code LLMs), recently introduced StarCoder2. This initiative marks a significant advancement in the field of code generation LLMs, extending the foundational work done on the initial StarCoder and The Stack datasets. In partnership with Software Heritage, the project has developed The Stack v2, a vastly expanded corpus for training code generation models. This blog post presents a comprehensive overview of StarCoder2, the development of The Stack v2, and the evaluations performed to gauge the models' capabilities.

Introduction to StarCoder 2

StarCoder2 encompasses a family of models with 3B, 7B, and 15B parameters, pushing the boundaries of what's possible in code generation. These models were trained using a dataset approximately 4 times larger than its predecessor, resulting in significant performance improvements. The training set, rooted in the Software Heritage archive and supplemented with other high-quality datasets, spans over 619 programming languages.

The Development of The Stack v2

The Stack v2 builds upon the digital commons of Software Heritage’s source code archive, enhanced with additional data sources like GitHub pull requests, Kaggle notebooks, and extensive documentation. This meticulously curated and cleaned dataset is 4 times larger than the first version of The Stack, facilitating the training of more nuanced and powerful models.

Evaluation and Benchmarks

StarCoder2 models were evaluated against a suite of benchmarks designed to test code completion, code fixing and editing, mathematical reasoning, and more. These evaluations show that, in many instances, the smaller StarCoder2-3B model outperforms other models of similar size and even surpasses the performance of larger, previously best-performing models. The largest in the family, StarCoder2-15B, sets new standards by matching or outperforming models more than twice its size on several benchmarks.

Repository-Level Code Completion

Focusing on practical applications, the models were assessed on their capability to perform code completion at the repository level, demonstrating significant improvements over earlier models. These improvements are credited to the methodology employed in creating The Stack v2 and the robust training approach that leveraged this expansive dataset.

Advancements and Social Impact

The development of StarCoder2 and The Stack v2 encapsulates the BigCode project’s commitment to open science, ethical data sourcing, and the acceleration of research in the development of Code LLMs. By ensuring transparency in the training data and providing open access to model weights, the project aids in democratizing AI advancements and fostering an environment of responsible AI development. Furthermore, the project addresses challenges in privacy, security, societal and representation biases, underscoring the importance of balanced and mindful technological progress.

Conclusion

StarCoder2 represents a leap forward in the domain of code generation with LLMs, supported by the extensive dataset provided by The Stack v2. These advancements showcase the potential of collaborative, open scientific projects in pushing the boundaries of AI and providing the groundwork for future innovations. As the BigCode project continues to evolve, it remains centered on the pillars of responsible development, open access, and community engagement, paving the way for more inclusive and ethically considered advancements in AI.

Acknowledgements

This work is a testament to the collaborative spirit of the BigCode community, Software Heritage, and all contributors across the globe. It is a powerful example of what can be achieved when the scientific community comes together in pursuit of open, responsible technological advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (156)
  1. Building the universal archive of source code. Communications of the ACM, 61(10):29–31, 2018. doi: 10.1145/3183558. URL https://cacm.acm.org/magazines/2018/10/231366-building-the-universal-archive-of-source-code/fulltext.
  2. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4895–4901, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.298. URL https://aclanthology.org/2023.emnlp-main.298.
  3. BigScience: a case study in the social construction of a multilingual large language model. In Workshop on Broadening Research Collaborations 2022, 2022. URL https://openreview.net/forum?id=2e346l2PPOm.
  4. Spacerini: Plug-and-play search engines with pyserini and Hugging Face. In Yansong Feng and Els Lefever (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  140–148, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.12. URL https://aclanthology.org/2023.emnlp-demo.12.
  5. A survey on data selection for language models. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.16827.
  6. CIDAR: culturally relevant instruction dataset for Arabic. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.03177.
  7. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint, July 2019. URL https://arxiv.org/abs/1907.05019.
  8. Arxiv, 2024. URL https://info.arxiv.org/help/bulk_data_s3.html.
  9. Program synthesis with large language models. arXiv preprint, August 2021. URL https://arxiv.org/abs/2108.07732.
  10. New frontiers: The origins and content of new work, 1940–2018. Technical Report 30389, National Bureau of Economic Research, August 2022. URL http://www.nber.org/papers/w30389.
  11. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4WnqRR915j.
  12. Efficient training of language models to fill in the middle. arXiv preprint, July 2022. URL https://arxiv.org/abs/2207.14255.
  13. Loubna Ben Allal. Big code models leaderboard, 2023. URL https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard.
  14. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  15. SantaCoder: don’t reach for the stars! arXiv preprint, August 2023. URL https://arxiv.org/abs/2301.03988.
  16. Purple llama CyberSecEval: A secure coding benchmark for language models. arXiv preprint, December 2023. URL https://arxiv.org/abs/2312.04724.
  17. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  2397–2430. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
  18. BigCode. Models by BigCode on Hugging Face, 2024. URL https://huggingface.co/api/models?author=bigcode&expand[]=downloadsAllTime. Accessed: 2024.
  19. The BigCode project governance card. arXiv preprint, December 2023. URL https://arxiv.org/abs/2312.03872.
  20. BigCode Project. Bigcode model license agreement, 2023a. URL https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement. Accessed: 2023.
  21. BigCode Project. BigCode open RAIL: Responsible AI licensing framework, 2023b. URL https://www.bigcode-project.org/docs/pages/bigcode-openrail/. Accessed: 2023.
  22. BigScience Workshop. BLOOM (revision 4ab0472), 2022. URL https://huggingface.co/bigscience/bloom.
  23. Blue Oak Council, 2024. URL https://blueoakcouncil.org/list.
  24. Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial pattern matching, pp.  1–10. Springer, 2000. URL https://link.springer.com/chapter/10.1007/3-540-45123-4_1.
  25. Description2Code dataset, August 2016. URL https://github.com/ethancaballero/description2code.
  26. Knowledge transfer from high-resource to low-resource programming languages for code LLMs. arXiv preprint, August 2023a. URL https://arxiv.org/abs/2308.09895.
  27. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023b. doi: 10.1109/TSE.2023.3267446. URL https://www.computer.org/csdl/journal/ts/2023/07/10103177/1MpWUtj7Rwk.
  28. Can it edit? evaluating the ability of large language models to follow code editing instructions. In The First International Workshop on Large Language Model for Code, 2024. URL https://arxiv.org/abs/2312.12450.
  29. ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  10628–10650, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.676. URL https://aclanthology.org/2023.findings-acl.676.
  30. Evaluating large language models trained on code. arXiv preprint, July 2021. URL https://arxiv.org/abs/2107.03374.
  31. Large language models at work in China’s labor market. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.08776.
  32. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  33. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  34. ClamAV, 2024. URL https://www.clamav.net/.
  35. Training verifiers to solve math word problems. arXiv preprint, October 2021. URL https://arxiv.org/abs/2110.14168.
  36. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  37. Software heritage: Why and how to preserve software source code. In iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017. URL https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2017-swh.pdf. https://hal.archives-ouvertes.fr/hal-01590958.
  38. Referencing source code artifacts: A separate concern in software citation. Computing in Science & Engineering, 22(2):33–43, 2020. doi: 10.1109/MCSE.2019.2963148.
  39. CrowdStrike. Polymorphic virus. https://www.crowdstrike.com/cybersecurity-101/malware/polymorphic-virus/, 2024. Accessed: 2024.
  40. CyberArk. Chatting our way into creating a polymorphic malware. https://www.cyberark.com/resources/threat-research-blog/chatting-our-way-into-creating-a-polymorphic-malware, 2024. Accessed: 2024.
  41. Cybersecurity & Infrastructure Security Agency. Secure by design, 2024. URL https://www.cisa.gov/resources-tools/resources/secure-by-design. Accessed: 2024.
  42. Tri Dao. FlashAttention-2: faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec.
  43. FlashAttention: fast and memory-efficient exact attention with IO-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  16344–16359. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.
  44. Harm de Vries. Go smol or go home. https://www.harmdevries.com/post/model-size-vs-compute-overhead/, 2023.
  45. BOLD: dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  862–872, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445924. URL https://doi.org/10.1145/3442188.3445924.
  46. Towards openness beyond open access: User journeys through 3 open AI collaboratives. In Workshop on Broadening Research Collaborations 2022, 2022. URL https://openreview.net/forum?id=slU-5h8rrCz.
  47. CrossCodeEval: a diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=wgDcbBMSfh.
  48. Building guardrails for large language models. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.01822.
  49. KTO: model alignment as prospect theoretic optimization. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.01306.
  50. Large language models for software engineering: Survey and open problems. arXiv preprint, October 2023. URL https://arxiv.org/abs/2310.03533.
  51. PAL: Program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  10764–10799. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/gao23f.html.
  52. RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  53. Gemini Team et al. Gemini: a family of highly capable multimodal models. arXiv preprint, 2023. URL https://arxiv.org/abs/2312.11805.
  54. Github Archive, 2024. URL https://gharchive.org.
  55. go-enry, 2024. URL https://github.com/go-enry/go-enry.
  56. Goldman Sachs. The generative world order: AI, geopolitics, and power, 2024. URL https://www.goldmansachs.com/intelligence/pages/the-generative-world-order-ai-geopolitics-and-power.html.
  57. Governance AI. Open sourcing highly capable foundation models, 2024. URL https://www.governance.ai/research-paper/open-sourcing-highly-capable-foundation-models. Accessed: 2024.
  58. OLMo: accelerating the science of language models. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.00838.
  59. ComPile: a large IR dataset from production sources. arXiv preprint, September 2023. URL https://arxiv.org/abs/2309.15432.
  60. CRUXEval: a benchmark for code reasoning, understanding and execution. arXiv preprint, January 2024. URL https://arxiv.org/abs/2401.03065.
  61. DeepSeek-Coder: when the large language model meets programming – the rise of code intelligence. arXiv preprint, 2024. URL https://arxiv.org/abs/2401.14196.
  62. From ChatGPT to ThreatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access, 11:80218–80245, 2023. ISSN 2169-3536. doi: 10.1109/access.2023.3300381. URL http://dx.doi.org/10.1109/ACCESS.2023.3300381.
  63. esCorpius: a massive spanish crawling corpus. In IberSPEECH 2022, pp.  126–130, 2022. doi: 10.21437/IberSPEECH.2022-26. URL https://www.isca-speech.org/archive/pdfs/iberspeech_2022/gutierrezfandino22_iberspeech.pdf.
  64. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html.
  65. Software Heritage. Software heritage community. https://www.softwareheritage.org/community/, 2024. Accessed: 2024.
  66. Large language models for software engineering: A systematic literature review. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.10620.
  67. Bias testing and mitigation in LLM-based code generation. arXiv preprint, 2023. URL https://arxiv.org/abs/2309.14345.
  68. Mistral 7B. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.06825.
  69. LISA: language models of ISAbelle proofs. In 6th Conference on Artificial Intelligence and Theorem Proving, pp.  378–392, 2021. URL http://aitp-conference.org/2021/abstract/paper_17.pdf.
  70. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  71. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=pxpbTdUEpD.
  72. Quantifying the carbon emissions of machine learning. arXiv preprint, October 2019. URL https://arxiv.org/abs/1910.09700.
  73. DS-1000: A natural and reliable benchmark for data science code generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  18319–18345. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/lai23b.html.
  74. LLVM: a compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004., pp.  75–86. IEEE, 2004. URL https://ieeexplore.ieee.org/document/1281665.
  75. StarCoder: may the source be with you! arXiv preprint, May 2023. URL https://arxiv.org/abs/2305.06161.
  76. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158.
  77. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
  78. RepoBench: Benchmarking repository-level code auto-completion systems. arXiv preprint, June 2023b. URL https://arxiv.org/abs/2306.03091.
  79. The data provenance initiative: A large scale audit of dataset licensing & attribution in AI. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.16787.
  80. Fingpt: Large generative models for a small language. arXiv preprint arXiv:2311.05640, 2023. URL https://arxiv.org/abs/2311.05640.
  81. Language models as a service: Overview of a new paradigm and its challenges. arXiv preprint, 2023. URL https://arxiv.org/abs/2309.16573.
  82. Marc Marone and Benjamin Van Durme. Data portraits: Recording foundation model training data. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://arxiv.org/abs/2303.03919.
  83. The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, January 2020. doi: 10.1145/3372885.3373824. URL http://dx.doi.org/10.1145/3372885.3373824.
  84. Open Science in Software Engineering, pp.  477–501. Springer International Publishing, 2020. doi: 10.1007/978-3-030-32489-6_17. URL http://dx.doi.org/10.1007/978-3-030-32489-6_17.
  85. Ralph C. Merkle. A digital signature based on a conventional encryption function. In Conference on the theory and application of cryptographic techniques, pp.  369–378. Springer, 1987.
  86. A landscape study of open source and proprietary tools for software bill of materials (sbom). arXiv preprint, 2024. URL https://arxiv.org/abs/2402.11151.
  87. Mike Mirzayanov. Codeforces: Results of 2020 [annual report]. https://codeforces.com/blog/entry/89502, 2020.
  88. Use of LLMs for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.12833.
  89. MSFT Q2 Earning Call, 2024. URL https://www.microsoft.com/en-us/investor/events/fy-2024/earnings-fy-2024-q2.aspx.
  90. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022a. doi: 10.48550/ARXIV.2210.07316. URL https://arxiv.org/abs/2210.07316.
  91. Crosslingual generalization through multitask finetuning, 2022b. URL https://arxiv.org/abs/2211.01786.
  92. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35.
  93. OctoPack: instruction tuning code large language models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=mw1PWNSWZP.
  94. Generative representational instruction tuning. arXiv preprint, 2024b. URL https://arxiv.org/abs/2402.09906.
  95. Auditing large language models: A three-layered approach. AI Ethics, 2023. URL https://doi.org/10.1007/s43681-023-00289-2.
  96. A comparative study of programming languages in Rosetta code. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pp.  778–788. IEEE, 2015. URL https://ieeexplore.ieee.org/document/7194625.
  97. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
  98. HONEST: Measuring hurtful sentence completion in language models. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2398–2406, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.191. URL https://aclanthology.org/2021.naacl-main.191.
  99. OpenAI et al. GPT-4 technical report. arXiv preprint, March 2023. URL https://arxiv.org/abs/2303.08774.
  100. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi (eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora, pp.  9 – 16, Mannheim, July 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  101. OpenWebMath: an open dataset of high-quality mathematical web text. arXiv preprint, October 2023. URL https://arxiv.org/abs/2310.06786.
  102. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  754–768. IEEE, 2022.
  103. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kM5eGcdCzq.
  104. The impact of AI on developer productivity: Evidence from GitHub Copilot. arXiv preprint, 2023. URL https://arxiv.org/abs/2302.06590.
  105. The software heritage graph dataset: Large-scale analysis of public software development history. In MSR 2020: The 17th International Conference on Mining Software Repositories, pp.  1–5. IEEE, 2020. doi: 10.1145/3379597.3387510. URL https://arxiv.org/abs/2011.07824https://www.softwareheritage.org/wp-content/uploads/2021/03/msr-2020-challenge.pdf.
  106. The ROOTS search tool: Data transparency for LLMs. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  304–314, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.29. URL https://aclanthology.org/2023.acl-demo.29.
  107. GAIA search: Hugging Face and pyserini interoperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  588–598, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.57. URL https://aclanthology.org/2023.acl-demo.57.
  108. Stable code 3B: Coding on the edge. Stability AI, 2024. URL https://stability.ai/news/stable-code-2024-llm-code-completion-release.
  109. BigCode Project. The stack v2, 2024. URL https://huggingface.co/datasets/bigcode/the-stack-v2/. Accessed: 2024.
  110. CodeNet: a large-scale AI for code dataset for learning a diversity of coding tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=6vZVBkCDrHT.
  111. RedPajama Wiki, 2024. URL https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/wiki.
  112. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  113. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint, 2020. URL https://arxiv.org/abs/2009.10297.
  114. Rosetta Code, 2023. URL https://rosettacode.org/.
  115. Code llama: Open foundation models for code. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.12950.
  116. Sane Security, 2024. URL https://sanesecurity.com/usage/signatures.
  117. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  118. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
  119. ScanCode, 2024. URL https://github.com/nexB/scancode-toolkit.
  120. ScanCode License Categories, 2024. URL https://scancode-licensedb.aboutcode.org/help.html##license-categories.
  121. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022a. doi: 10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  122. What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424, 2022b.
  123. ServiceNow. Text2flow LLM: Automating workflow generation from descriptive text. https://downloads.docs.servicenow.com/resource/enus/infocard/text2flow-llm.pdf, 2024a.
  124. ServiceNow. Text-to-code LLM: transforming natural language into executable code, 2024b. URL https://downloads.docs.servicenow.com/resource/enus/infocard/text-to-code-llm.pdf.
  125. Noam Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
  126. The woman worked as a babysitter: On biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3407–3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339.
  127. Ten simple rules for helping newcomers become contributors to open projects. PLoS Computational Biology, 15(9):e1007296, 2019. doi: 10.1371/journal.pcbi.1007296. URL https://doi.org/10.1371/journal.pcbi.1007296.
  128. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.06619.
  129. Software Heritage. Swh statement on llm for code, 2023. URL https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/.
  130. Software Heritage. Bulk access terms of use, 2024a. URL https://www.softwareheritage.org/legal/bulk-access-terms-of-use/.
  131. Software Heritage, 2024b. URL https://www.softwareheritage.org.
  132. Irene Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint, 2023. URL https://arxiv.org/abs/2302.04844.
  133. Dolma: an open corpus of three trillion tokens for language model pretraining research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
  134. StackExchange Archive, 2024. URL https://archive.org/details/stackexchange.
  135. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint, April 2021. URL https://arxiv.org/abs/2104.09864.
  136. Code translation with compiler representations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XomEU3eNeSQ.
  137. Helping code reviewer prioritize: Pinpointing personal data and its processing. Frontiers in Artificial Intelligence and Applications, 371:109–124, 2023. doi: 10.3233/FAIA230228.
  138. Finding privacy-relevant source code. arXiv preprint, 2024. URL https://arxiv.org/abs/2401.07316.
  139. The SWHID Specification Project. The SWHID specification, 2024. URL https://www.swhid.org/.
  140. Together Computer. RedPajama: an open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  141. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint, 2023. URL https://arxiv.org/abs/2307.09288.
  142. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.07827.
  143. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1667–1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132. URL https://aclanthology.org/2021.acl-long.132.
  144. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, pp.  1–27, 2024. doi: 10.1109/TSE.2024.3368208. URL https://arxiv.org/abs/2307.07221.
  145. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  146. Open science is a research accelerator. Nature chemistry, 3 10:745–8, 2011. URL https://api.semanticscholar.org/CorpusID:205289283.
  147. World Economic Forum. Jobs of tomorrow: Large language models and jobs, 2024. URL https://www.weforum.org/publications/jobs-of-tomorrow-large-language-models-and-jobs/.
  148. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hNhwSmtXRh.
  149. Yahoo Finance. ServiceNow Inc (NYSE: NOW) Q4 earnings: What to expect, 2024. URL https://finance.yahoo.com/news/servicenow-inc-nyse-now-q4-154816487.html.
  150. Gotcha! this model uses my code! evaluating membership leakage risks in code models. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.01166.
  151. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.01825.
  152. Gender bias in coreference resolution: Evaluation and debiasing methods. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  15–20, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URL https://aclanthology.org/N18-2003.
  153. Red teaming ChatGPT via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint, 2023a. URL https://arxiv.org/abs/2301.12867.
  154. Source code data augmentation for deep learning: A survey. arXiv preprint, May 2023b. URL https://arxiv.org/abs/2305.19915.
  155. Astraios: Parameter-efficient instruction tuning code large language models. arXiv preprint, August 2024. URL https://arxiv.org/abs/2401.00788.
  156. Measuring GitHub Copilot’s impact on productivity. Commun. ACM, 67(3):54–63, feb 2024. ISSN 0001-0782. doi: 10.1145/3633453. URL https://doi.org/10.1145/3633453.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (66)
  1. Anton Lozhkov (7 papers)
  2. Raymond Li (24 papers)
  3. Loubna Ben Allal (12 papers)
  4. Federico Cassano (16 papers)
  5. Joel Lamy-Poirier (9 papers)
  6. Nouamane Tazi (8 papers)
  7. Ao Tang (27 papers)
  8. Dmytro Pykhtar (2 papers)
  9. Jiawei Liu (156 papers)
  10. Yuxiang Wei (40 papers)
  11. Tianyang Liu (24 papers)
  12. Max Tian (2 papers)
  13. Denis Kocetkov (5 papers)
  14. Arthur Zucker (2 papers)
  15. Younes Belkada (9 papers)
  16. Zijian Wang (99 papers)
  17. Qian Liu (252 papers)
  18. Dmitry Abulkhanov (7 papers)
  19. Indraneil Paul (9 papers)
  20. Zhuang Li (69 papers)
Citations (174)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews