Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grammatical information in BERT sentence embeddings as two-dimensional arrays (2312.09890v1)

Published 15 Dec 2023 in cs.CL

Abstract: Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by LLMs and help move towards developing few-shot learning approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. BLM-AgrF: A new French benchmark to investigate generalization of agreement in neural networks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1363–1374, Dubrovnik, Croatia. Association for Computational Linguistics.
  2. Enriching word vectors with subword information. TACL, 5:135–146.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611.
  7. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computational Linguistics.
  8. Yoav Goldberg. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
  9. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1195–1205. Association for Computational Linguistics.
  10. beta-vae: Learning basic visual concepts with a constrained variational framework.
  11. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  12. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  13. Mechanisms for handling nested dependencies in neural-network language models and humans. Cognition.
  14. Probing for the usage of grammatical number. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8818–8831, Dublin, Ireland. Association for Computational Linguistics.
  15. FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised language model pre-training for French). In Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles, pages 268–278, Nancy, France. ATALA et AFCP.
  16. Tal Linzen and Marco Baroni. 2021. Syntactic structure from deep learning. Annual Review of Linguistics, 7(1):195–212.
  17. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association of Computational Linguistics, 4(1):521–535.
  18. Tal Linzen and Brian Leonard. 2018. Distinct patterns of syntactic agreement errors in recurrent networks and humans. In Proceedings of the 40th Cognitive Science Society.
  19. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117:30046 – 30054.
  20. Gary Marcus. 2022. The dark risk of large language models. WIRED.
  21. Revisiting the poverty of the stimulus: Hierarchical generalization without a hierarchical bias in recurrent neural networks. In Proceedings of the 40th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.
  22. Blackbird’s language matrices (BLM): a new benchmark to investigate disentangled generalisation in neural networks.
  23. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
  24. Dmitry Nikolaev and Sebastian Padó. 2023. Representation biases in sentence transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3701–3716, Dubrovnik, Croatia. Association for Computational Linguistics.
  25. Glove: Global vectors for word representation. In Proceedings EMNLP), pages 1532–1543.
  26. Deep contextualized word representations. In Proc. of NAACL.
  27. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  28. John C. Raven. 1938. Standardization of progressive matrices. British Journal of Medical Psychology, 19:137–150.
  29. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  30. Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human singularity. Proceedings of the National Academy of Sciences, 118(16).
  31. Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
  32. Suzanne Stevenson and Paola Merlo. 2022. Beyond the benchmarks: Toward human-like lexical representations. Frontiers of Artificial Intelligence, 5:796741.
  33. Teo Susnjak. 2022. ChatGPT: The end of online exam integrity?
  34. Investigating novel verb learning in BERT: Selectional preference classes and alternation-based syntactic generalization. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 265–275, Online. Association for Computational Linguistics.
  35. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  36. Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Citations (6)

Summary

We haven't generated a summary for this paper yet.