Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language models scale reliably with over-training and on downstream tasks (2403.08540v2)

Published 13 Mar 2024 in cs.CL and cs.LG
Language models scale reliably with over-training and on downstream tasks

Abstract: Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how LLMs are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a LLM to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

Scaling Laws for Over-trained LLMs and Downstream Task Prediction

Introduction to Scaling in Over-trained Regime and Downstream Performance Prediction

In the field of machine learning, particularly within the paper of LLMs (LMs), understanding the behavior of models as they scale is crucial for both theoretical insight and practical application. Recent research has taken significant strides toward mapping the landscape of how LLMs scale, especially under the lens of compute and parameter size optimization, known as the "Chinchilla optimal" regime. However, a gap persists in our understanding, particularly regarding models that are over-trained to reduce inference costs and how these scaling laws translate into performance on downstream tasks rather than merely predicting next-token perplexity.

This analysis aims to bridge these gaps by conducting an extensive set of experiments characterizing the behavior of LLMs when over-trained and evaluating their performance on downstream tasks. Through the examination of 104 models, ranging from 0.011B to 6.9B parameters, trained with varying numbers of tokens and on different data distributions, we derive and validate scaling laws that accurately predict both over-trained model performance and downstream task effectiveness.

Over-training and Its Predictable Nature

Investigating the over-trained regime, we discovered that models display consistent scaling trends even when the training data volume significantly exceeds the compute-optimal level. Our analyses demonstrate that both the validation loss and the downstream task performance of these models can be accurately predicted by fitting scaling laws to small-scale experimental data. Notably, the paper made successful predictions about the performance of exceptionally large-scale models (1.4B and 6.9B parameters), significantly reducing the computational expense required for direct evaluation.

Implications for Downstream Task Performance

Furthermore, we present a novel relationship between the perplexity of a LLM and its performance on downstream tasks, framed within a power-law context. This finding is pivotal as it allows for the prediction of downstream performance solely from a model's perplexity, thereby offering a computationally efficient approach to estimate the practical utility of LLMs in real-world applications.

Theoretical and Practical Contributions

Theoretically, this work enhances our understanding of LLM behavior in the over-trained regime, offering insights into how and why these models scale as they do. Practically, it provides a valuable tool for predicting downstream task performance, significantly impacting the development and application of LLMs by enabling more efficient resource allocation during training.

Future Directions

This research opens several avenues for future exploration, including refining the scaling laws to incorporate the effects of hyperparameter choices, validating the current findings with even larger models, and extending these laws to predict model performance on individual downstream tasks. Moreover, investigating the application of these scaling laws in the context of models fine-tuned with supervised or reinforcement learning methods could further augment their utility in applied settings.

Conclusion

Our experiments contribute landmark findings to the field of LLM scaling by meticulously detailing the scaling behavior of over-trained models and relating model perplexity to downstream task performance in a quantifiable manner. These contributions not only advance our theoretical understanding of LLM scaling laws but also offer practical tools for predicting the performance of these models in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (130)
  1. Exploring the limits of large scale pre-training. In International Conference on Learning Representations (ICLR), 2022. https://arxiv.org/abs/2110.02095.
  2. Revisiting neural scaling laws in language and vision. In Advances in Neural Information Processing Systems (NeuIPS), 2022. https://arxiv.org/abs/2209.06640.
  3. A survey on data selection for language models. arXiv preprint, 2024. https://arxiv.org/abs/2402.16827.
  4. Santacoder: don’t reach for the stars! arXiv preprint, 2023. https://arxiv.org/abs/2301.03988.
  5. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Conference of the North American Chapter of the Association for Computational Linguistics (NACCL), 2019. https://aclanthology.org/N19-1245.
  6. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024. https://pytorch.org/blog/pytorch-2-paper-tutorial.
  7. Efficient large scale language modeling with mixtures of experts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. https://aclanthology.org/2022.emnlp-main.804.
  8. Layer normalization. arXiv preprint, 2016. https://arxiv.org/abs/1607.06450.
  9. Explaining neural scaling laws. arXiv preprint, 2021. https://arxiv.org/abs/2102.06701.
  10. Data scaling laws in nmt: The effect of noise and architecture. In International Conference on Machine Learning (ICML), 2022. https://proceedings.mlr.press/v162/bansal22b.html.
  11. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. In Transactions on Machine Learning Research (TMLR), 2023. https://openreview.net/forum?id=uyTL5Bvosj.
  12. Piqa: Reasoning about physical commonsense in natural language. In Association for the Advancement of Artificial Intelligence (AAAI), 2020. https://arxiv.org/abs/1911.11641.
  13. Gpt-neox-20b: An open-source autoregressive language model. BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, 2022. https://aclanthology.org/2022.bigscience-1.9.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2005.14165.
  15. Broken neural scaling laws. In International Conference on Learning Representations (ICLR), 2023. https://openreview.net/forum?id=sckjveqlCZ.
  16. Reproducible scaling laws for contrastive language-image learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://arxiv.org/abs/2212.07143.
  17. Palm: Scaling language modeling with pathways. In Journal of Machine Learning Research (JMLR), 2022. https://arxiv.org/abs/2204.02311.
  18. Scaling instruction-finetuned language models. arXiv preprint, 2022. https://arxiv.org/abs/2210.11416.
  19. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. https://aclanthology.org/N19-1300.
  20. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR), 2020. https://openreview.net/pdf?id=r1xMH1BtvB.
  21. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint, 2018. https://arxiv.org/abs/1803.05457.
  22. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. https://arxiv.org/abs/2205.14135.
  23. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning (ICML), 2023. https://proceedings.mlr.press/v202/dehghani23a.html.
  24. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. https://aclanthology.org/N19-1423.
  25. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. https://aclanthology.org/2021.emnlp-main.98.
  26. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning (ICML), 2022. https://arxiv.org/abs/2112.06905.
  27. Kto: Model alignment as prospect theoretic optimization. arXiv preprint, 2024. https://arxiv.org/abs/2402.01306.
  28. Datacomp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://arxiv.org/abs/2304.14108.
  29. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint, 2020. https://arxiv.org/abs/2101.00027.
  30. Scaling laws for neural machine translation. arXiv preprint, 2021. https://arxiv.org/abs/2109.07740.
  31. Data and parameter scaling laws for neural machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. https://aclanthology.org/2021.emnlp-main.478.
  32. Olmo: Accelerating the science of language models. arXiv preprint, 2024. https://arxiv.org/abs/2402.00838.
  33. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint, 2023. https://arxiv.org/abs/2312.00752.
  34. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://openreview.net/forum?id=yWd42CWN3c.
  35. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR), 2022. https://arxiv.org/abs/2111.00396.
  36. Textbooks are all you need. Preprint, 2023. https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need.
  37. OpenLM: a minimal but performative language modeling (lm) repository, 2023. https://github.com/mlfoundations/open_lm.
  38. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. https://arxiv.org/abs/2009.03300.
  39. Scaling laws for autoregressive generative modeling. arXiv preprint, 2020. https://arxiv.org/abs/2010.14701.
  40. Scaling laws for transfer. arXiv preprint, 2021. https://arxiv.org/abs/2102.01293.
  41. Deep learning scaling is predictable, empirically. arXiv preprint, 2017. https://arxiv.org/abs/1712.00409.
  42. Beyond human-level accuracy: Computational challenges in deep learning. In Principles and Practice of Parallel Programming (PPoPP), 2019. https://arxiv.org/abs/1909.01736.
  43. Training compute-optimal large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022. https://arxiv.org/abs/2203.15556.
  44. Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations (ICLR), 2017. https://arxiv.org/abs/1611.01462.
  45. Scaling laws for downstream task performance of large language models. arXiv, 2024. https://arxiv.org/abs/2402.04177.
  46. Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. https://aclanthology.org/2022.findings-emnlp.544.
  47. Mistral 7b. arXiv preprint, 2023. https://arxiv.org/abs/2310.06825.
  48. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. https://aclanthology.org/D19-1259.
  49. Scaling laws for neural language models. arXiv preprint, 2020. https://arxiv.org/abs/2001.08361.
  50. Analyzing the sample complexity of self-supervised image reconstruction methods. arXiv preprint, 2023. https://arxiv.org/abs/2305.19079.
  51. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint, 2019. http://arxiv.org/abs/1909.11942.
  52. xformers: A modular and hackable transformer modelling library, 2022. https://github.com/facebookresearch/xformers.
  53. The winograd schema challenge. In International conference on the principles of knowledge representation and reasoning, 2012. https://aaai.org/papers/59-4492-the-winograd-schema-challenge.
  54. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020. https://aclanthology.org/2020.acl-main.703.
  55. Starcoder: may the source be with you! arXiv preprint, 2023. https://arxiv.org/abs/2305.06161.
  56. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In International Joint Conference on Artificial Intelligence, 2020. https://arxiv.org/abs/2007.08124.
  57. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint, 2019. http://arxiv.org/abs/1907.11692.
  58. A convnet for the 2020s. Conference on Computer Vision and Pattern Recognition (CVPR), 2022. https://arxiv.org/abs/2201.03545.
  59. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint, 2023. https://arxiv.org/abs/2310.16787.
  60. Decoupled weight decay regularization. arXiv preprint, 2017. https://arxiv.org/abs/1711.05101.
  61. Starcoder 2 and the stack v2: The next generation. arXiv preprint, 2024. https://arxiv.org/abs/2402.19173.
  62. Fingpt: Large generative models for a small language. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. https://aclanthology.org/2023.emnlp-main.164.
  63. Paloma: A benchmark for evaluating language model fit. arXiv preprint, 2023. https://paloma.allen.ai.
  64. Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, 1993. https://aclanthology.org/J93-2004.
  65. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. https://aclanthology.org/2021.emnlp-main.133.
  66. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. https://arxiv.org/abs/1809.02789.
  67. MosaicML. Llm evaluation scores, 2023. https://www.mosaicml.com/llm-evaluation.
  68. Crosslingual generalization through multitask finetuning. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022. https://aclanthology.org/2023.acl-long.891.
  69. Octopack: Instruction tuning code large language models. arXiv preprint, 2023a. https://arxiv.org/abs/2308.07124.
  70. Scaling data-constrained language models. In Advances in Neural Information Processing Systems (NeuIPS), 2023b. https://arxiv.org/abs/2305.16264.
  71. Generative representational instruction tuning. arXiv preprint, 2024. https://arxiv.org/abs/2402.09906.
  72. Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. arXiv preprint, 2023. https://arxiv.org/abs/2309.03450.
  73. OpenAI. Triton, 2021. https://github.com/openai/triton.
  74. OpenAI. Gpt-4 technical report, 2023. https://arxiv.org/abs/2303.08774.
  75. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Annual Meeting of the Association for Computational Linguistics (ACL), 2016. http://www.aclweb.org/anthology/P16-1144.
  76. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2002. https://aclanthology.org/P02-1040.
  77. BBQ: A hand-built bias benchmark for question answering. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022. https://aclanthology.org/2022.findings-acl.165.
  78. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
  79. Patronus AI. EnterprisePII dataset, 2023. https://tinyurl.com/2r5x9bst.
  80. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint, 2023. https://arxiv.org/abs/2306.01116.
  81. RWKV: Reinventing RNNs for the transformer era. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. https://aclanthology.org/2023.findings-emnlp.936.
  82. Using the output embedding to improve language models. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017. https://aclanthology.org/E17-2025.
  83. Language models are unsupervised multitask learners. Preprint, 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  84. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint, 2021. https://arxiv.org/abs/2112.11446.
  85. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://arxiv.org/abs/2305.18290.
  86. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint, 2019. https://arxiv.org/abs/1910.10683.
  87. Exploring the limits of transfer learning with a unified text-to-text transformer. In The Journal of Machine Learning Research (JMLR), 2020. https://arxiv.org/abs/1910.10683.
  88. SQuAD: 100,000+ questions for machine comprehension of text. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. https://aclanthology.org/D16-1264.
  89. CoQA: A conversational question answering challenge. In Transactions of the Association for Computational Linguistics (TACL), 2019. https://aclanthology.org/Q19-1016.
  90. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Association for the Advancement of Artificial Intelligence (AAAI) Spring Symposium, 2011. https://people.ict.usc.edu/~gordon/copa.html.
  91. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/1909.12673.
  92. Gender bias in coreference resolution. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018. https://aclanthology.org/N18-2002.
  93. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint, 2019. https://arxiv.org/abs/1907.10641.
  94. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint, 2019. http://arxiv.org/abs/1910.01108.
  95. Social IQa: Commonsense reasoning about social interactions. In Empirical Methods in Natural Language Processing (EMNLP), 2019. https://aclanthology.org/D19-1454.
  96. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP), 2023. https://arxiv.org/abs/2401.00448.
  97. What language model to train if you have one million gpu hours? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. https://aclanthology.org/2022.findings-emnlp.54.
  98. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023. https://arxiv.org/abs/2304.15004.
  99. A neural scaling law from the dimension of the data manifold. In Journal of Machine Learning Research (JMLR), 2022. https://arxiv.org/abs/2004.10802.
  100. Noam Shazeer. Glu variants improve transformer. arXiv preprint, 2020. https://arxiv.org/abs/2002.05202.
  101. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024. https://arxiv.org/abs/2402.06619.
  102. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint, 2024. https://arxiv.org/abs/2402.00159.
  103. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems (NeurIPS), 2022. https://openreview.net/forum?id=UmvSlP-PyV.
  104. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint, 2021. https://arxiv.org/abs/2104.09864.
  105. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. https://aclanthology.org/N19-1421.
  106. Scale efficiently: Insights from pre-training and fine-tuning transformers. In International Conference on Learning Representations (ICLR), 2022. https://openreview.net/forum?id=f2OYVDyfIB.
  107. Scaling laws vs model architectures: How does inductive bias influence scaling? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. https://aclanthology.org/2023.findings-emnlp.825.
  108. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. www.mosaicml.com/blog/mpt-7b.
  109. Lamda: Language models for dialog applications. arXiv preprint, 2022. https://arxiv.org/abs/2201.08239.
  110. Together Computer. Redpajama: an open dataset for training large language models, 2023. https://github.com/togethercomputer/RedPajama-Data.
  111. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint, 2023a. https://arxiv.org/abs/2302.13971.
  112. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint, 2023b. https://arxiv.org/abs/2307.09288.
  113. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint, 2024. https://arxiv.org/abs/2402.07827.
  114. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. https://arxiv.org/abs/1706.03762.
  115. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020. https://rdcu.be/b08Wh.
  116. From lsat: The progress and challenges of complex reasoning. Transactions on Audio, Speech, and Language Processing, 2021. https://arxiv.org/abs/2108.00648.
  117. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR), 2022a. https://openreview.net/forum?id=gEZrGCozdqR.
  118. Emergent abilities of large language models. In Transactions on Machine Learning Research (TMLR), 2022b. https://openreview.net/forum?id=yzkSU5zdwD.
  119. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint, 2022. https://arxiv.org/abs/2211.05100.
  120. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint, 2023. https://arxiv.org/abs/2309.14322.
  121. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer. In Advances in Neural Information Processing Systems (NeuIPS), 2021. https://arxiv.org/abs/2203.03466.
  122. Feature learning in infinite depth neural networks. In International Conference on Learning Representations (ICLR), 2024. https://openreview.net/forum?id=17pVDnpwwl.
  123. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics (ACL), 2019. https://aclanthology.org/P19-1472.
  124. Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022. https://arxiv.org/abs/2106.04560.
  125. Root mean square layer normalization. In Advances in Neural Information Processing Systems (NeuIPS), 2019. https://arxiv.org/abs/1910.07467.
  126. Improving deep transformer with depth-scaled initialization and merged attention. In Empirical Methods in Natural Language Processing (EMNLP), 2019. https://aclanthology.org/D19-1083.
  127. Pytorch fsdp: Experiences on scaling fully sharded data parallel. In Very Large Data Bases Conference (VLDB), 2023. https://dl.acm.org/doi/10.14778/3611540.3611569.
  128. Jec-qa: A legal-domain question answering dataset. In Association for the Advancement of Artificial Intelligence (AAAI), 2020. https://arxiv.org/abs/1911.12011.
  129. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint, 2023. https://arxiv.org/abs/2304.06364.
  130. Astraios: Parameter-efficient instruction tuning code large language models. arXiv preprint, 2024. https://arxiv.org/abs/2401.00788.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (25)
  1. Samir Yitzhak Gadre (12 papers)
  2. Georgios Smyrnis (6 papers)
  3. Vaishaal Shankar (31 papers)
  4. Suchin Gururangan (29 papers)
  5. Mitchell Wortsman (29 papers)
  6. Rulin Shao (20 papers)
  7. Jean Mercat (15 papers)
  8. Alex Fang (13 papers)
  9. Jeffrey Li (11 papers)
  10. Sedrick Keh (8 papers)
  11. Rui Xin (17 papers)
  12. Marianna Nezhurina (11 papers)
  13. Igor Vasiljevic (20 papers)
  14. Jenia Jitsev (27 papers)
  15. Alexandros G. Dimakis (133 papers)
  16. Gabriel Ilharco (26 papers)
  17. Shuran Song (110 papers)
  18. Thomas Kollar (27 papers)
  19. Yair Carmon (45 papers)
  20. Achal Dave (31 papers)
Citations (29)