Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards a theory of model distillation (2403.09053v2)

Published 14 Mar 2024 in cs.LG, cs.AI, and cs.NE

Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (108)
  1. Neural network learning: Theoretical foundations, volume 9. Cambridge University Press, 1999.
  2. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
  3. On the non-universality of deep learning: quantifying the cost of symmetry. Advances in Neural Information Processing Systems, 35:17188–17201, 2022.
  4. The staircase property: How hierarchical structure can guide deep learning. Advances in Neural Information Processing Systems, 34:26989–27002, 2021.
  5. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  6. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In The Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023.
  7. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion, 58:82–115, 2020.
  8. Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-based systems, 8(6):373–389, 1995.
  9. Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023.
  10. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023.
  11. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.
  12. Inspecting the concept knowledge graph encoded by modern language models. arXiv preprint arXiv:2105.13471, 2021.
  13. D Angluin. Remarks on the difficulty of finding a minimal disjunctive normal form for boolean functions. Unpublished manuscript.
  14. Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pages 903–925. PMLR, 2023.
  15. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
  16. Francis Bach. Learning theory from first principles. Draft of a book, version of Sept, 6:2021, 2021.
  17. On learning gaussian multi-index models with gradient flow. arXiv preprint arXiv:2310.19793, 2023.
  18. Exact learning of juntas from membership queries. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 115–129. Springer, 2016.
  19. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
  20. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
  21. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
  22. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  23. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35:37932–37946, 2022.
  24. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994.
  25. Classification and regression trees. 1984.
  26. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  27. Interpretability via model extraction. arXiv preprint arXiv:1706.09773, 2017.
  28. Transformers learn through gradual rank increase. arXiv preprint arXiv:2306.07042, 2023.
  29. Properly learning decision trees in almost polynomial time. Journal of the ACM, 69(6):1–19, 2022.
  30. Gulp: a prediction-based metric between representations. Advances in Neural Information Processing Systems, 35:7115–7127, 2022.
  31. Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report, 1(2):4, 1996.
  32. Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
  33. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  34. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29, 1990.
  35. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
  36. Using sampling and queries to extract rules from trained neural networks. In Machine learning proceedings 1994, pages 37–45. Elsevier, 1994.
  37. Extracting tree-structured representations of trained networks. Advances in neural information processing systems, 8, 1995.
  38. Interpretable by design: Learning predictors by composing interpretable queries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7430–7443, 2022.
  39. Peter Damaschke. Adaptive versus nonadaptive attribute-efficient learning. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 590–596, 1998.
  40. Peter Damaschke. Computational aspects of parallel attribute-efficient learning. In International Conference on Algorithmic Learning Theory, pages 103–111. Springer, 1998.
  41. Optimal two-stage algorithms for group testing problems. SIAM Journal on Computing, 34(5):1253–1270, 2005.
  42. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
  43. Learning two-layer neural networks, one (giant) step at a time. arXiv preprint arXiv:2305.18270, 2023.
  44. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
  45. Pareto frontiers in neural feature learning: Data, compute, width, and luck. arXiv preprint arXiv:2309.03800, 2023.
  46. Learning decision trees from random examples. Information and Computation, 82(3):231–246, 1989.
  47. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  48. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  49. Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
  50. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
  51. Deep learning with limited numerical precision. In International conference on machine learning, pages 1737–1746. PMLR, 2015.
  52. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  53. Exact learning when irrelevant variables abound. Information Processing Letters, 70(5):233–239, 1999.
  54. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  55. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
  56. Adaptive wavelet distillation from neural networks through interpretations. Advances in Neural Information Processing Systems, 34:20669–20682, 2021.
  57. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  58. One tree to explain them all. In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 1444–1451. IEEE, 2011.
  59. Learning decision trees using the fourier spectrum. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 455–464, 1991.
  60. Properly learning decision trees with queries is np-hard. arXiv preprint arXiv:2307.04093, 2023.
  61. Superpolynomial lower bounds for decision tree learning and testing. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1962–1994. SIAM, 2023.
  62. An introduction to computational learning theory. MIT press, 1994.
  63. Structure compilation: trading structure for features. In Proceedings of the 25th international conference on Machine learning, pages 592–599, 2008.
  64. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
  65. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737, 2021.
  66. Purifying interaction effects with the functional anova: An efficient algorithm for recovering identifiable additive models. In International Conference on Artificial Intelligence and Statistics, pages 2402–2412. PMLR, 2020.
  67. Clayton McMillan. Rule induction in a neural network through integrated symbolic and subsymbolic processing. University of Colorado at Boulder, 1992.
  68. Neural networks efficiently learn low-dimensional representations with sgd. arXiv preprint arXiv:2209.14863, 2022.
  69. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  70. Rule induction through integrated symbolic and subsymbolic processing. Advances in neural information processing systems, 4, 1991.
  71. Laughing hyena distillery: Extracting compact recurrences from convolutions. arXiv preprint arXiv:2310.18780, 2023.
  72. Decision tree approximations of boolean functions. Theoretical Computer Science, 270(1-2):609–623, 2002.
  73. Foundations of machine learning. MIT press, 2018.
  74. Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592, 2019.
  75. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  76. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746–751, 2013.
  77. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.
  78. Revisiting self-distillation. arXiv preprint arXiv:2206.08491, 2022.
  79. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
  80. Towards understanding knowledge distillation. In International conference on machine learning, pages 5142–5151. PMLR, 2019.
  81. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
  82. Computational limitations on learning from examples. Journal of the ACM (JACM), 35(4):965–984, 1988.
  83. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys, 16:1–85, 2022.
  84. Log-linear guardedness and its implications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9413–9431, 2023.
  85. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
  86. Linear adversarial concept erasure. In International Conference on Machine Learning, pages 18400–18421. PMLR, 2022.
  87. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558, 2023.
  88. Does string-based neural mt learn source syntax? In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1526–1534, 2016.
  89. Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 303–310, 2018.
  90. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154, 2023.
  91. Sebastian Thrun. Extracting provably correct rules from artificial neural networks. Citeseer, 1993.
  92. Access to unlabeled data can speed up prediction time. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 641–648, 2011.
  93. Seeing the forest through the trees: Learning a comprehensible model from an ensemble. In Machine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings 18, pages 418–429. Springer, 2007.
  94. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  95. VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
  96. The method of ordered risk minimization, i. Avtomatika i Telemekhanika, 8:21–30, 1974.
  97. Exploring the linear subspace hypothesis in gender bias mitigation. arXiv preprint arXiv:2009.09435, 2020.
  98. Jesse Vig. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714, 2019.
  99. A genetic algorithm for interpretable model extraction from decision tree ensembles. In Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2017 Workshops, MLSDA, BDM, DM-BPM Jeju, South Korea, May 23, 2017, Revised Selected Papers 21, pages 104–115. Springer, 2017.
  100. Concept algebra for score-based conditional model. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023.
  101. Thinking like transformers. In International Conference on Machine Learning, pages 11080–11090. PMLR, 2021.
  102. Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. arXiv preprint arXiv:2312.01429, 2023.
  103. Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019.
  104. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024.
  105. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
  106. Interpreting models via single tree approximation. arXiv preprint arXiv:1610.09036, 2016.
  107. A generic approach for reproducible model distillation. arXiv preprint arXiv:2211.12631, 2022.
  108. Approximation trees: Statistical stability in model distillation. arXiv preprint arXiv:1808.07573, 2018.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets