Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs (2410.18779v1)

Published 24 Oct 2024 in cs.LG and cs.CL
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Abstract: A primary challenge in LLM development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small LLM (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

An Overview of "A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs"

The paper addresses the computational challenges faced in pre-training LLMs by introducing an approach that leverages small LLMs (SLMs) to enhance the efficiency and quality of the training process. This paper proposes a methodology wherein SLMs are utilized to provide soft labels and select informative training examples, ultimately facilitating a more effective transfer of information to LLMs.

The researchers present both empirical and theoretical frameworks to validate this paradigm. Empirically, the approach demonstrates a reduction in training time coupled with improvements in model quality. The SLM-assisted training method is tested with a 2.8B parameter LLM utilizing a 1.5B parameter SLM on the Pile dataset, yielding superior results compared to conventional training methods.

Methodology

  1. Soft Labels and Data Selection: SLMs are employed to generate soft labels, offering supplementary supervision during training. Additionally, these models assist in identifying valuable subsets of training data that are both informative and challenging.
  2. Statistical Framework: The theoretical model delineates how SLM-generated supervision, though potentially low in quality, can be advantageous by properly balancing bias and variance in training.
  3. Adaptive Utilization: The importance of adaptively utilizing SLM-derived supervision is emphasized, suggesting that the focus should be on scenarios where SLM predictions align closely with the true data distribution.
  4. Knowledge Distillation (KD): Through KD, the paper extends the classic teacher-student model, showing improved training effectiveness even when using a weaker teacher model (SLM) to guide a stronger student (LLM).

Key Findings

  • The empirical results show that the proposed method reduces LLM training time while enhancing performance metrics, such as few-shot learning accuracy.
  • The adaptation at training allows SLMs to prioritize easier examples initially, letting LLMs refine their focus during subsequent training phases.
  • Utilizing SLMs during the early stages of LLM pre-training captures simpler patterns, reducing overall computational demands.

Implications

The research provides a pragmatic approach to LLM training, suggesting a pathway to achieving efficient computation without sacrificing model quality. The implications are particularly meaningful given the substantial resources typically required for LLM development.

Future Directions: The potential for SLMs to be used in this capacity sets a precedent for exploring further instantiations of such architectures; seeking novel architectures that could inherently combine the efficiencies of SLM with the capabilities of LLM is a promising avenue.

In summary, this paper effectively demonstrates how small models, traditionally overshadowed by their larger counterparts, can play a pivotal role in optimizing the creation and efficiency of LLMs. By leveraging concise, targeted supervision, the paper not only presents a technique that holds significant promise but also paves the way for further exploration into scalable AI development practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arxiv:2303.09540, 2023.
  2. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW.
  3. AI@Meta. Llama 3 Model Card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  4. A Survey on Data Selection for Language Models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=XfHWcNTSHp.
  5. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA.
  6. PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403, 2023.
  7. Perplexed by perplexity: Perplexity-based data pruning with small reference models, 2024. URL https://arxiv.org/abs/2405.20541.
  8. AI Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card, 2024.
  9. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732, 2021.
  10. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  11. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on Empirical Methods in Natural Language Processing, pages 1533–1544, 2013.
  12. PIQA: Reasoning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  13. JAX: composable transformations of Python+NumPy programs. 2018. URL http://github.com/google/jax.
  14. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 535–541, New York, NY, USA, 2006. ACM.
  15. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  16. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv preprint arXiv:2305.05176, 2023.
  17. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022.
  18. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  19. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  20. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1, 2018.
  21. Together Computer. RedPajama: an Open Dataset for Training Large Language Models. October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  22. Distilling Double Descent. arXiv preprint arXiv:2102.06849, 2021.
  23. The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-33428-6.
  24. Knowledge distillation as semiparametric inference. In International Conference on Learning Representations, 2021a.
  25. Knowledge Distillation as Semiparametric Inference. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=m4UCf24r0Y.
  26. The CommitmentBank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107–124, Jul. 2019. doi: 10.18148/sub/2019.v23i2.601. URL https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601.
  27. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 17–23 Jul 2022.
  28. Hoeffding’s inequality for supermartingales. Stochastic Processes and their Applications, 122(10):3545–3559, 2012.
  29. Specializing Smaller Language Models towards Multi-Step Reasoning. arXiv preprint arXiv:2301.12726, 2023.
  30. Born-Again Neural Networks. In International conference on machine learning, pages 1607–1616. PMLR, 2018.
  31. Language models scale reliably with over-training and on downstream tasks, 2024.
  32. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  33. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805, 2023.
  34. Efficient Training of BERT by Progressively Stacking. In International Conference on Machine Learning, pages 2337–2346. PMLR, 2019.
  35. SemEval-2012 task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://aclanthology.org/S12-1052.
  36. Knowledge Distillation: A Survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  37. On the Transformer Growth for Progressive BERT Training. arXiv:2010.12562, 2020.
  38. MiniLLM: Knowledge Distillation of Large Language Models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ.
  39. Language Model Cascades: Token-Level Uncertainty And Beyond. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KgaBScZ4VI.
  40. Supervision complexity and its role in knowledge distillation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8jU7wy7N7mA.
  41. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, 2021.
  42. Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS, 2021.
  43. Distilling the knowledge in a neural network, 2015.
  44. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NeurIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088.
  45. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  46. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv preprint arXiv:1909.10351, 2019.
  47. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.
  48. Sgd on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/b432f34c5a997c8e7c806a895ecc5e25-Paper.pdf.
  49. Not all samples are created equal: Deep learning with importance sampling. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/katharopoulos18a.html.
  50. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1023. URL https://aclanthology.org/N18-1023.
  51. Sequence-Level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016.
  52. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71. Association for Computational Linguistics, November 2018.
  53. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, 2020.
  54. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794. Association for Computational Linguistics, September 2017.
  55. SQuAD2-CR: Semi-supervised Annotation for Cause and Rationales for Unanswerability in SQuAD 2.0. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5425–5432, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.667.
  56. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, 01 2019. doi: 10.18653/v1/P19-1612.
  57. The Winograd Schema Challenge. In 13th International Conference on the Principles of Knowledge Representation and Reasoning, KR 2012, Proceedings of the International Conference on Knowledge Representation and Reasoning, pages 552–561. Institute of Electrical and Electronics Engineers Inc., 2012. ISBN 9781577355601.
  58. Rho-1: Not All Tokens Are What You Need. arXiv preprint arXiv:2404.07965, 2024.
  59. SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
  60. Non-Vacuous Generalization Bounds for Large Language Models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 32801–32818. PMLR, 21–27 Jul 2024a. URL https://proceedings.mlr.press/v235/lotfi24a.html.
  61. Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models. arXiv preprint arXiv:2407.18158, 2024b.
  62. Empirical Bernstein Bounds and Sample Variance Penalization. arXiv preprint arXiv:0907.3740, 2009.
  63. A Statistical Perspective on Distillation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/menon21a.html.
  64. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In EMNLP, 2018.
  65. Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt. In International Conference on Machine Learning (ICML), 2022.
  66. An Emulator for Fine-tuning Large Language Models using Small Language Models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Eo7kv0sllr.
  67. Self-Distillation Amplifies Regularization in Hilbert Space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
  68. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.
  69. On student-teacher deviations in distillation: does it pay to disobey? Advances in Neural Information Processing Systems, 36:5961–6000, 2023.
  70. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Stefan Riezler and Yoav Goldberg, editors, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https://aclanthology.org/K16-1028.
  71. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807. Association for Computational Linguistics, October-November 2018. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  72. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.441. URL https://aclanthology.org/2020.acl-main.441.
  73. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  74. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.
  75. Instruction Tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
  76. Towards Understanding Knowledge Distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142–5151. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/phuong19a.html.
  77. WiC: 10,000 Example Pairs for Evaluating Context-Sensitive Representations. arXiv, abs/1808.09121, 2018.
  78. Multiclass Classification Calibration Functions. arXiv preprint arXiv:1609.06385, 2016.
  79. Knowledge Inheritance for Pre-trained Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, July 2022.
  80. Efficient Training of Language Models using Few-Shot Learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 14553–14568. PMLR, 2023.
  81. Neural networks trained with SGD learn distributions of increasing complexity. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28843–28863. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/refinetti23a.html.
  82. Better supervisory signals by observing learning paths. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.
  83. Better Supervisory Signals by Observing Learning Paths. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=Iog0djAdbHj.
  84. Scaling Up Models and Data with t5x and seqio. arXiv:2203.17189, 2022. URL https://arxiv.org/abs/2203.17189.
  85. S. M. Ross. Stochastic Processes. Wiley Series in Probability and Mathematical Statistics, 1983.
  86. LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions. arXiv preprint arXiv:2310.14029, 2023.
  87. Knowledge Distillation Performs Partial Variance Reduction. In Advances in Neural Information Processing Systems, volume 36, pages 75229–75258. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ee1f0da706829d7f198eac0edaacc338-Paper-Conference.pdf.
  88. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. In AAAI, pages 8732–8740. AAAI Press, 2020.
  89. Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization, 2024.
  90. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  91. Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv preprint arXiv:1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
  92. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  93. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv preprint arXiv:2201.11990, 2022.
  94. Ingo Steinwart. How to Compare Different Loss Functions and Their Risks. Constructive Approximation, 26(2):225–287, 2007.
  95. Patient Knowledge Distillation for BERT Model Compression. arXiv preprint arXiv:1908.09355, 2019.
  96. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  97. UL2: Unifying Language Learning Paradigms. arXiv preprintarXiv:2205.05131, 2023.
  98. LaMDA: Language Models for Dialog Applications. arXiv preprint arXiv:2201.08239, 2022.
  99. D4: Improving LLM pretraining via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36, 2024.
  100. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  101. Mimetic Initialization of Self-Attention Layers. In International Conference on Machine Learning, pages 34456–34468. PMLR, 2023.
  102. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models. arXiv preprint arXiv:1908.08962, 2019.
  103. Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 2017.
  104. SuperGLUE: A Stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
  105. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  106. LEMON: Lossless model expansion. arXiv preprint arXiv:2310.07999, 2023.
  107. f-Divergence Minimization for Sequence-Level Knowledge Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605.
  108. Self-training with Noisy Student improves ImageNet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  109. A Survey on Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2402.13116, 2024.
  110. Towards Understanding Label Smoothing. arXiv preprint arXiv:2006.11653, 2020.
  111. Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
  112. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models. arXiv preprint arXiv:2403.07384, 2024.
  113. Revisiting Knowledge Distillation via Label Smoothing Regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3903–3911, 2020.
  114. Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6okaSfANzh.
  115. Large language models as markov chains, 2024. URL https://arxiv.org/abs/2410.02724.
  116. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800. Association for Computational Linguistics, July 2019. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  117. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension. CoRR, abs/1810.12885, 2018.
  118. Tong Zhang. Statistical Analysis of Some Multi-Category Large Margin Classification Methods. Journal of Machine Learning Research, 5(Oct):1225–1251, 2004.
  119. Knowledge distillation based on transformed teacher matching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=MJ3K7uDGGl.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Ankit Singh Rawat (64 papers)
  2. Veeranjaneyulu Sadhanala (8 papers)
  3. Afshin Rostamizadeh (35 papers)
  4. Ayan Chakrabarti (42 papers)
  5. Wittawat Jitkrittum (42 papers)
  6. Vladimir Feinberg (8 papers)
  7. Seungyeon Kim (22 papers)
  8. Hrayr Harutyunyan (19 papers)
  9. Nikunj Saunshi (23 papers)
  10. Zachary Nado (23 papers)
  11. Rakesh Shivanna (10 papers)
  12. Sashank J. Reddi (43 papers)
  13. Aditya Krishna Menon (56 papers)
  14. Rohan Anil (32 papers)
  15. Sanjiv Kumar (123 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com