Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Probe Behaviors through Variational Bounds of Mutual Information (2312.10019v1)

Published 15 Dec 2023 in cs.IT, cs.LG, eess.AS, and math.IT

Abstract: With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Understanding intermediate layers using linear classifier probes,” in International Conference on Learning Representations, Workshop Track Proceedings, 2017.
  2. Yonatan Belinkov, On internal language representations in deep learning: An analysis of machine translation and speech recognition, Ph.D. thesis, Massachusetts Institute of Technology, 2018.
  3. “Designing and interpreting probes with control tasks,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 2019, pp. 2733–2743, Association for Computational Linguistics.
  4. “Information-theoretic probing for linguistic structure,” in Annual Meeting of the Association for Computational Linguistics, Online, 2020, pp. 4609–4622, Association for Computational Linguistics.
  5. Yonatan Belinkov, “Probing classifiers: Promises, shortcomings, and advances,” Computational Linguistics, vol. 48, no. 1, pp. 207–219, 2022.
  6. “Information-theoretic probing with minimum description length,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 2020, pp. 183–196, Association for Computational Linguistics.
  7. “Classifier probes may just learn from linear context features,” in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 5136–5146, International Committee on Computational Linguistics.
  8. “Mutual information neural estimation,” in International Conference on Machine Learning, ICML. 2018, vol. 80 of Proceedings of Machine Learning Research, pp. 530–539, PMLR.
  9. “Combating the instability of mutual information-based losses via regularization,” in Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands. 2022, vol. 180 of Proceedings of Machine Learning Research, pp. 411–421, PMLR.
  10. “On variational bounds of mutual information,” in International Conference on Machine Learning, ICML. 2019, vol. 97 of Proceedings of Machine Learning Research, pp. 5171–5180, PMLR.
  11. Elements of information theory, Wiley, New York, 1991.
  12. “Opening the black box of deep neural networks via information,” ArXiv preprint, vol. abs/1703.00810, 2017.
  13. “Asymptotic evaluation of certain markov process expectations for large time. iv,” Communications on pure and applied mathematics, vol. 36, no. 2, pp. 183–212, 1983.
  14. “Representation learning with contrastive predictive coding,” ArXiv preprint, vol. abs/1807.03748, 2018.
  15. “A theory of usable information under computational constraints,” in International Conference on Learning Representations, ICLR. 2020, OpenReview.net.
  16. “Understanding dataset difficulty with v-usable information,” in International Conference on Machine Learning, ICML. PMLR, 2022, pp. 5988–6008.
  17. “Comparative layer-wise analysis of self-supervised speech models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  18. “Cross-entropy loss functions: Theoretical analysis and applications,” in International Conference on Machine Learning, ICML. 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 23803–23828, PMLR.
  19. “Layer-wise analysis of a self-supervised speech representation model,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921.
  20. “What do self-supervised speech models know about words?,” ArXiv preprint, vol. abs/2307.00162, 2023.
  21. “Common phone: A multilingual dataset for robust acoustic modelling,” in Language Resources and Evaluation Conference, Marseille, France, 2022, pp. 763–768, European Language Resources Association.
  22. “XLS-R: self-supervised cross-lingual speech representation learning at scale,” in Annual Conference of the International Speech Communication Association (Interspeech). 2022, pp. 2278–2282, ISCA.
  23. “Decoupled weight decay regularization,” in International Conference on Learning Representations, ICLR. 2019, OpenReview.net.
  24. Allan Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
  25. “Tidal: Learning training dynamics for active learning,” in International Conference on Computer Vision, ICCV, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.