Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics (2403.01509v2)

Published 3 Mar 2024 in cs.CL

Abstract: LLMs have achieved remarkable success in general language understanding tasks. However, as a family of generative methods with the objective of next token prediction, the semantic evolution with the depth of these models are not fully explored, unlike their predecessors, such as BERT-like architectures. In this paper, we specifically investigate the bottom-up evolution of lexical semantics for a popular LLM, namely Llama2, by probing its hidden states at the end of each layer using a contextualized word identification task. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask LLMing, where the higher layers obtain better lexical semantics. The conclusion is further supported by the monotonic increase in performance via the hidden states for the last meaningless symbols, such as punctuation, in the prompting strategy. Our codes are available at https://github.com/RyanLiut/LLM_LexSem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. ConSeC: Word sense disambiguation as continuous sense comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1492–1503, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65.
  4. Allyson Ettinger. 2020. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.
  5. Aina Garí Soler and Marianna Apidianaki. 2021. Scalar adjective identification and multilingual ranking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4653–4660, Online. Association for Computational Linguistics.
  6. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586.
  7. John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
  8. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  9. Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  11. Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861.
  12. Daniel Loureiro and Alipio Jorge. 2019. Liaad at semdeep-5 challenge: Word-in-context (wic). In Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5), pages 1–5.
  13. Representation of lexical stylistic features in language models’ embedding space. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 370–387, Toronto, Canada. Association for Computational Linguistics.
  14. context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 51–61, Berlin, Germany. Association for Computational Linguistics.
  15. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  16. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  17. In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
  18. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  19. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  20. Erika Petersen and Christopher Potts. 2023. Lexical semantics with large language models: A case study of English “break”. In Findings of the Association for Computational Linguistics: EACL 2023, pages 490–511, Dubrovnik, Croatia. Association for Computational Linguistics.
  21. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  22. Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of NAACL-HLT, pages 1267–1273.
  23. James Pustejovsky. 1998. The generative lexicon. MIT press.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  26. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China. Association for Computational Linguistics.
  27. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In NeurIPS ML Safety Workshop.
  28. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics.
  29. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428.
  30. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
  31. A survey of large language models.
  32. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
Citations (4)

Summary

We haven't generated a summary for this paper yet.