Papers
Topics
Authors
Recent
2000 character limit reached

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models (2311.05928v2)

Published 10 Nov 2023 in cs.CL, cs.AI, cs.IT, cs.LG, math.GN, and math.IT

Abstract: In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Critical learning periods in deep neural networks.
  2. Mira Ait-Saada and Mohamed Nadif. 2023. Is anisotropy truly harmful? a case study on text clustering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1194–1203, Toronto, Canada. Association for Computational Linguistics.
  3. Falcon-40B: an open large language model with state-of-the-art performance.
  4. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery, 32:1–38.
  5. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada. Association for Computational Linguistics.
  6. Too much in common: Shifting of embeddings in transformer language models and its implications. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5117–5130, Online. Association for Computational Linguistics.
  7. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. On isotropy calibration of transformer models. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 1–9, Dublin, Ireland. Association for Computational Linguistics.
  10. Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
  11. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany. Association for Computational Linguistics.
  12. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. CoRR, abs/1803.06992.
  13. Manifold-adaptive dimension estimation. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, volume 227 of ACM International Conference Proceeding Series, pages 265–272. ACM.
  14. Representation degeneration problem in training natural language generation models.
  15. Is anisotropy inherent to transformers? CoRR, abs/2306.07656.
  16. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  17. Measuring the intrinsic dimension of objective landscapes. CoRR, abs/1804.08838.
  18. Roberta: A robustly optimized bert pretraining approach.
  19. Efficient estimation of word representations in vector space.
  20. Attentional probe: Estimating a module’s functional potential. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11459–11472, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  21. Language models are unsupervised multitask learners.
  22. Benjamin Schweinhart. 2021. Persistent homology and the upper box dimension. Discret. Comput. Geom., 65(2):331–364.
  23. Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep neural networks via information.
  24. Llama 2: Open foundation and fine-tuned chat models.
  25. Intrinsic dimension estimation for robust detection of ai-generated texts.
  26. On isotropy of multimodal embeddings. Inf., 14(7):392.
  27. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  28. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.
  29. Opt: Open pre-trained transformer language models.
  30. Fine-tuning happens in tiny subspaces: Exploring intrinsic task-specific subspaces of pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1701–1713, Toronto, Canada. Association for Computational Linguistics.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.