Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (2306.13649v3)

Published 23 Jun 2023 in cs.LG, cs.AI, and cs.CL
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Abstract: Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive LLMs on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

Analysis of Formatting Instructions for ICLR 2024 Conference Submissions

The paper, titled "Formatting Instructions for ICLR 2024 Conference Submissions," serves as a comprehensive guideline for researchers preparing manuscripts for the International Conference on Learning Representations (ICLR) in 2024. The document meticulously details the structural and stylistic requirements that ensure uniformity and clarity across submissions.

Key Elements of the Paper

The paper outlines several critical elements that authors must adhere to:

  1. Submission Process: It emphasizes the necessity of electronic submissions through the OpenReview platform. This digital submission process is designed to streamline the review workflow.
  2. Formatting Specifications: Authors are required to follow a modified NeurIPS format. Specific guidelines on paper dimensions, typeface usage, and page limits (9-page main text) are provided. This ensures that submissions are consistent and easily navigable for reviewers.
  3. Style File Usage: The document stresses the importance of utilizing the ICLR-specific LaTeX style files, which are accessible online. Deviating from these files may result in rejection, underscoring the conference's commitment to maintaining formatting standards.
  4. Headings and Structure: A hierarchical structure is laid out for headings, comprising three levels each with distinct formatting specifications. This facilitates logical and coherent presentation of the content.
  5. Figures and Tables: Detailed instructions are given on the inclusion and formatting of figures and tables, which must be neat, centered, and appropriately captioned.
  6. References and Citations: Guidelines on citing works using the natbib package are given, allowing for standardization in referencing which aids credibility and scholarly communication.
  7. Standardized Notations: The inclusion of standardized mathematical notations from the Deep Learning textbook promotes consistency in mathematical expressions, enhancing interpretability across works.

Implications and Future Directions

The instructions presented in the paper hold substantial significance for the academic community. By enforcing a standardized format, ICLR ensures that the focus remains on the scientific content rather than presentation discrepancies. This alignment facilitates efficient peer review and uniform accessibility for published papers.

Additionally, the emphasis on precise formatting could encourage other conferences to adopt similar approaches, fostering a broader culture of rigor in research presentations across the field.

In terms of future developments in AI and conference submissions, as conferences grow and adapt, we might anticipate even more automation in the submission and formatting process. Tools could be developed to automatically check compliance with formatting guidelines, further reducing the administrative burden on researchers.

The ongoing evolution of digital submission platforms could also see enhancements in collaboration tools, allowing for more seamless interactions between authors, reviewers, and conference organizers. As AI research continues to expand, standardizing these processes will remain crucial for maintaining quality and integrity in scholarly communications.

Overall, these formatting instructions not only guide researchers in structuring their submissions but also reflect ICLR's dedication to promoting high-quality, accessible research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in Neural Information Processing Systems, 35:28955–28971, 2022.
  2. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
  3. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  4. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp.  12–58, 2014.
  5. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  535–541, 2006.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  10. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  11. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  12. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  13. Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
  14. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4163–4174, 2020.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  16. Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pp.  8077–8083. IEEE, 2019.
  17. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
  18. Tuan Anh Le. Reverse vs forward kl, December 2017. URL https://www.tuananhle.co.uk/notes/reverse-forward-kl.html.
  19. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022.
  20. Autoregressive knowledge distillation through imitation learning. arXiv preprint arXiv:2009.07253, 2020.
  21. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  22. Teaching small language models to reason. arXiv preprint arXiv:2212.08410, 2022.
  23. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  24. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  25. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  28. Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991.
  29. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  31. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  32. Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv preprint arXiv:2306.00186, 2023.
  33. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  661–668. JMLR Workshop and Conference Proceedings, 2010.
  34. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  35. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  36. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  37. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  38. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  39. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  40. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  41. f-divergence minimization for sequence-level knowledge distillation. arXiv preprint arXiv:2307.15190, 2023.
  42. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866, 2018.
  43. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  44. Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp.  1097–1100, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rishabh Agarwal (47 papers)
  2. Nino Vieillard (22 papers)
  3. Yongchao Zhou (7 papers)
  4. Piotr Stanczyk (12 papers)
  5. Sabela Ramos (10 papers)
  6. Matthieu Geist (93 papers)
  7. Olivier Bachem (52 papers)
Citations (46)