Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pareto Probing: Trading Off Accuracy for Complexity (2010.02180v3)

Published 5 Oct 2020 in cs.CL and cs.LG

Abstract: The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723.
  2. Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes. In The 5th International Conference on Learning Representations.
  3. Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering, 25(4):543–557.
  4. Automatic conversion of the Basque dependency treebank to universal dependencies. In Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories, pages 233–241.
  5. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 233–242.
  6. Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications. Theoretical Computer Science, 425:75–103.
  7. Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
  8. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  9. English web treebank. Linguistic Data Consortium.
  10. Léonard Blier and Yann Ollivier. 2018. The description length of deep learning models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 2216–2226.
  11. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  12. Life and work of Wilhelm Cauer (1900–1945). In Proceedings of the 14th International Symposium Mathematical Theory of Networks and Systems, pages 1–10.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In The 5th International Conference on Learning Representations.
  15. Truth or backpropaganda? An empirical investigation of deep learning theory. In International Conference on Learning Representations.
  16. Arjun K. Gupta and Daya K. Nagar. 2018. Matrix Variate Distributions, volume 104. CRC Press.
  17. A tale of a probe and a parser. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  18. Building the essential resources for Finnish: the Turku dependency treebank. Language Resources and Evaluation, 48(3):493–531.
  19. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
  20. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  21. Geoffrey E. Hinton and Drew Van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, pages 5–13.
  22. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61:907–926.
  23. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  24. Andrei N. Kolmogorov. 1963. On tables of random numbers. Sankhyā: The Indian Journal of Statistics, Series A, pages 369–376.
  25. ALBERT: A lite BERT for self-supervised learning of language representations. In The 8th International Conference on Learning Representations.
  26. On the effect of low-rank weights on adversarial robustness of neural networks. CoRR, abs/1901.10371.
  27. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy.
  28. LSTMs exploit linguistic attributes of data. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 180–186, Melbourne, Australia. Association for Computational Linguistics.
  29. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  30. Radford M. Neal. 2012. Bayesian Learning for Neural Networks, volume 118. Springer Science & Business Media.
  31. Decision trees and NLP: A case study in POS tagging. In Proceedings of Annual Conference on Artificial Intelligence (ACAI).
  32. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2089–2096, Istanbul, Turkey. European Languages Resources Association (ELRA).
  33. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  34. David Pollard. 1984. Convergence of Stochastic Processes. Springer Science & Business Media.
  35. Dimitris C. Psichogios and Lyle H. Ungar. 1992. A hybrid neural network-first principles approach to process modeling. AIChE Journal, 38(10):1499–1511.
  36. Vinit Ravishankar. 2017. A Universal Dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 190–200, Prague, Czech Republic.
  37. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501.
  38. Visualizing and measuring the geometry of BERT. In Advances in Neural Information Processing Systems 32, pages 8594–8603.
  39. Jorma Rissanen. 1978. Modeling by shortest data description. Automatica, 14(5):465–471.
  40. Naomi Saphra and Adam Lopez. 2019. Understanding learning dynamics of language models with SVCCA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3257–3267, Minneapolis, Minnesota. Association for Computational Linguistics.
  41. Gideon Schwarz. 1978. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464.
  42. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation.
  43. Universal Dependencies for Turkish. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3444–3454, Osaka, Japan. The COLING 2016 Organizing Committee.
  44. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  45. What do you learn from context? Probing for sentence structure in contextualized word representations. In The 7th International Conference on Learning Representations.
  46. Vladimir N. Vapnik and Alexey Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280.
  47. Parsing as pretraining. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20).
  48. Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  49. Evaluating representations by the complexity of learning low-loss predictors. arXiv preprint arXiv:2009.07368.
  50. HuggingFace’s Transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
  51. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  52. Universal dependencies 2.5. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  53. Understanding deep learning requires rethinking generalization. In The 5th International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tiago Pimentel (55 papers)
  2. Naomi Saphra (34 papers)
  3. Adina Williams (72 papers)
  4. Ryan Cotterell (226 papers)
Citations (59)