Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pareto Probing: Trading Off Accuracy for Complexity

Published 5 Oct 2020 in cs.CL and cs.LG | (2010.02180v3)

Abstract: The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723.
  2. Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes. In The 5th International Conference on Learning Representations.
  3. Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering, 25(4):543–557.
  4. Automatic conversion of the Basque dependency treebank to universal dependencies. In Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories, pages 233–241.
  5. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 233–242.
  6. Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications. Theoretical Computer Science, 425:75–103.
  7. Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
  8. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  9. English web treebank. Linguistic Data Consortium.
  10. Léonard Blier and Yann Ollivier. 2018. The description length of deep learning models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 2216–2226.
  11. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  12. Life and work of Wilhelm Cauer (1900–1945). In Proceedings of the 14th International Symposium Mathematical Theory of Networks and Systems, pages 1–10.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In The 5th International Conference on Learning Representations.
  15. Truth or backpropaganda? An empirical investigation of deep learning theory. In International Conference on Learning Representations.
  16. Arjun K. Gupta and Daya K. Nagar. 2018. Matrix Variate Distributions, volume 104. CRC Press.
  17. A tale of a probe and a parser. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  18. Building the essential resources for Finnish: the Turku dependency treebank. Language Resources and Evaluation, 48(3):493–531.
  19. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
  20. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  21. Geoffrey E. Hinton and Drew Van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, pages 5–13.
  22. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61:907–926.
  23. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  24. Andrei N. Kolmogorov. 1963. On tables of random numbers. Sankhyā: The Indian Journal of Statistics, Series A, pages 369–376.
  25. ALBERT: A lite BERT for self-supervised learning of language representations. In The 8th International Conference on Learning Representations.
  26. On the effect of low-rank weights on adversarial robustness of neural networks. CoRR, abs/1901.10371.
  27. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy.
  28. LSTMs exploit linguistic attributes of data. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 180–186, Melbourne, Australia. Association for Computational Linguistics.
  29. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  30. Radford M. Neal. 2012. Bayesian Learning for Neural Networks, volume 118. Springer Science & Business Media.
  31. Decision trees and NLP: A case study in POS tagging. In Proceedings of Annual Conference on Artificial Intelligence (ACAI).
  32. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2089–2096, Istanbul, Turkey. European Languages Resources Association (ELRA).
  33. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  34. David Pollard. 1984. Convergence of Stochastic Processes. Springer Science & Business Media.
  35. Dimitris C. Psichogios and Lyle H. Ungar. 1992. A hybrid neural network-first principles approach to process modeling. AIChE Journal, 38(10):1499–1511.
  36. Vinit Ravishankar. 2017. A Universal Dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 190–200, Prague, Czech Republic.
  37. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501.
  38. Visualizing and measuring the geometry of BERT. In Advances in Neural Information Processing Systems 32, pages 8594–8603.
  39. Jorma Rissanen. 1978. Modeling by shortest data description. Automatica, 14(5):465–471.
  40. Naomi Saphra and Adam Lopez. 2019. Understanding learning dynamics of language models with SVCCA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3257–3267, Minneapolis, Minnesota. Association for Computational Linguistics.
  41. Gideon Schwarz. 1978. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464.
  42. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation.
  43. Universal Dependencies for Turkish. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3444–3454, Osaka, Japan. The COLING 2016 Organizing Committee.
  44. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  45. What do you learn from context? Probing for sentence structure in contextualized word representations. In The 7th International Conference on Learning Representations.
  46. Vladimir N. Vapnik and Alexey Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280.
  47. Parsing as pretraining. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20).
  48. Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  49. Evaluating representations by the complexity of learning low-loss predictors. arXiv preprint arXiv:2009.07368.
  50. HuggingFace’s Transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
  51. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  52. Universal dependencies 2.5. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  53. Understanding deep learning requires rethinking generalization. In The 5th International Conference on Learning Representations.
Citations (59)

Summary

  • The paper introduces a Pareto-based framework that balances model complexity and accuracy in probing neural network embeddings.
  • It employs both parametric and non-parametric complexity metrics, demonstrating that challenging tasks like dependency parsing better capture syntactic information.
  • Empirical results show that contextual embeddings outperform simpler models on complex linguistic tasks, urging the need for refined probing methodologies.

"Pareto Probing: Trading Off Accuracy for Complexity" (2010.02180)

Introduction

The paper "Pareto Probing: Trading Off Accuracy for Complexity" addresses the ongoing challenge in NLP of effectively probing neural network representations to evaluate their encoded linguistic knowledge. It introduces a novel methodology that considers the trade-off between probe complexity and performance using the Pareto hypervolume as a metric. The authors advocate for utilizing dependency parsing over simpler probing tasks, like part-of-speech labeling, to reveal the syntactic richness of contextual embeddings like BERT.

Performance and Complexity Trade-Off

Probing is treated as a bi-objective optimization problem where the objectives are to minimize complexity while maximizing accuracy on linguistic tasks. The authors highlight that simplistic probing tasks may not accurately reflect the linguistic structure encoded, as shown in results where non-contextual embeddings, such as fastText, outperform BERT on easier tasks. This trade-off necessitates a balanced evaluation considering both dimensions using a Pareto frontier.

Probing with Pareto Frontier

The concept of Pareto optimality serves as a framework to analyze probes that balance simplicity and performance. The set of Pareto optimal probes creates a frontier that allows for comprehensive comparison across different models by examining those that are undominated in terms of both dimensions. The Pareto hypervolume, which bounds complexity, facilitates this evaluation and aids in identifying models that encode linguistic structures more effectively.

Parametric and Non-Parametric Complexity Metrics

The study introduces two complexity metrics: parametric, involving constraints like nuclear norms and rank within linear probes, and non-parametric, focused on a model's ability to memorize training data. These measures help quantify how complexity within a given probing architecture influences performance and provide insights into how well simple versus complex models capture linguistic information.

Empirical Results and Insights

Experiments conducted over diverse languages using representations from BERT, ALBERT, and RoBERTa show that contextual embeddings encode significant syntactic information, particularly in more complex tasks such as dependency parsing. Results demonstrate the limitations of toy tasks like POSL and DAL in distinguishing between the nuanced syntactic capabilities of different embeddings. The authors substantiate claims by showing how these embeddings yield better performance on more challenging parsing tasks, supporting the argument for more complex probing tasks.

Proposal for Harder Probing Tasks

The paper argues for the necessity of moving probing benchmarks to tasks that require modeling complete sentence structures, such as dependency parsing. The findings show that contextual models unveil their syntactic superiority through these challenging tasks, aligning with their known capacity to boost performance in comprehensive NLP applications.

Conclusion

"Pareto Probing: Trading Off Accuracy for Complexity" posits that analyzing neural network representations requires probing tasks that reflect real-world complexity. The Pareto-based approach coupled with dependency parsing tasks provides a clearer picture of the linguistic knowledge embedded in modern NLP models. This work calls for an advancement in probing methodologies tailored to harness the full potential of contextual embeddings, fostering deeper insights and more reliable evaluations.

The implementation insights and results from this research assist in framing probing as a nuanced evaluation metric, prompting further methodological refinements and advancements in understanding network representations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.