Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Language Models as Hierarchy Encoders (2401.11374v4)

Published 21 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Interpreting hierarchical structures latent in language is a key limitation of current LMs. While previous research has implicitly leveraged these hierarchies to enhance LMs, approaches for their explicit encoding are yet to be explored. To address this, we introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs), harnessing the expansive nature of hyperbolic space. Our method situates the output embedding space of pre-trained LMs within a Poincar\'e ball with a curvature that adapts to the embedding dimension, followed by training on hyperbolic clustering and centripetal losses. These losses are designed to effectively cluster related entities (input as texts) and organise them hierarchically. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines, focusing on their capabilities in simulating transitive inference, predicting subsumptions, and transferring knowledge across hierarchies. The results demonstrate that HiTs consistently outperform all baselines in these tasks, underscoring the effectiveness and transferability of our re-trained hierarchy encoders.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
  2. Probing bert in hyperbolic spaces. In International Conference on Learning Representations, 2020.
  3. Owl2vec*: Embedding of owl ontologies. Machine Learning, 110(7):1813–1845, 2021.
  4. Contextual semantic embeddings for ontology subsumption prediction. World Wide Web, pp.  1–23, 2023.
  5. Foodon: a harmonized food ontology to increase global food traceability, quality control and data integration. npj Science of Food, 2(1):23, 2018.
  6. Hyperbolic neural networks. Advances in neural information processing systems, 31, 2018a.
  7. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, pp.  1646–1655. PMLR, 2018b.
  8. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, 2021.
  9. Sorbet: A siamese network for ontology embeddings using a distance-based regression loss and bert. In International Semantic Web Conference, pp.  561–578. Springer, 2023.
  10. Gromov, M. Hyperbolic groups. In Essays in group theory, pp.  75–263. Springer, 1987.
  11. Schema. org: evolution of structured data on the web. Communications of the ACM, 59(2):44–51, 2016.
  12. Analyzing bert’s knowledge of hypernymy via prompting. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  275–282, 2021.
  13. Deeponto: A python package for ontology engineering with deep learning. arXiv preprint arXiv:2307.03067, 2023a.
  14. Language model analysis for ontology subsumption inference. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  3439–3453, Toronto, Canada, July 2023b. Association for Computational Linguistics.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp.  2, 2019.
  16. Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819, 2020.
  17. Does bert know that the is-a relation is transitive? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  94–99, 2022.
  18. Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4228–4238, 2021.
  19. Concept placement using bert trained by transforming and summarizing biomedical ontology structure. Journal of Biomedical Informatics, 112:103607, 2020.
  20. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  21. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  22. Miller, G. A. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  23. Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017.
  24. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  25. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2463–2473, 2019.
  26. Improving language understanding by generative pre-training. 2018.
  27. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, 2019.
  28. Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2699–2712, 2020.
  29. Disease ontology: a backbone for disease semantic integration. Nucleic acids research, 40(D1):D940–D946, 2012.
  30. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  31. Snomed clinical terms: overview of the development process and project status. In Proceedings of the AMIA Symposium, pp.  662. American Medical Informatics Association, 2001.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  35. Xiao, H. bert-as-service. https://github.com/hanxiao/bert-as-service, 2018.
  36. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
Citations (4)

Summary

  • The paper presents a novel approach by re-training language models as Hierarchy Transformer encoders (HiTs) to better capture hierarchical structures.
  • It employs hyperbolic clustering and centripetal losses, achieving superior performance in transitive inference and subsumption prediction compared to traditional models.
  • Evaluations on datasets like WordNet demonstrate that HiTs more accurately position entities within semantic hierarchies, indicating strong potential for hierarchy-based applications.

Introduction

LMs have marked a significant progression in NLP, with transformer-based models such as BERT, GPT, and other LLMs like GPT-4 and Llama 2 achieving remarkable success. Despite their advancements, encoding and interpreting hierarchical structures latent in language remains a challenge for current LMs. Hanna & Mareček (2021) and He et al. (2023b) demonstrated the limited hierarchical knowledge in pre-trained LMs, while Lin & Ng (2022) showed these models' struggles with transitivity of hierarchical relationships. Various methods to incorporate hierarchical information into LMs have been explored, yet explicit encoding of hierarchies warrants further attention.

Novel Approach: Hierarchy Transformer encoders (HiTs)

This paper bridges the gap with a novel approach for re-training transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs), utilizing hyperbolic geometry's effectiveness in representing hierarchical structures. The method re-situates the output embedding space of LMs within a Poincaré ball with a curvature adapting to the embedding dimension. Hyperbolic clustering and centripetal losses are introduced to cluster entities and organize them hierarchically. HiTs are evaluated against pre-trained and fine-tuned LMs, demonstrating superior performance in simulating transitive inference, predicting subsumptions, and knowledge transfer across hierarchies.

Implementation Insights and Evaluation

The paper provides a comprehensive overview of key concepts including transformer encoder-based LMs and hyperbolic geometry, also providing a formal definition of hierarchy. The method harnesses the output embedding space of the transformer encoder-based LMs, typically within a d-dimensional hyper-cube due to the tanh activation function, and constructs a Poincaré ball with a boundary that encircles the hyper-cube, using hyperbolic space for re-training. HiTs’ capabilities are showcased in the Multi-hop Inference and Mixed-hop Prediction tasks. The evaluation leverages datasets derived from WordNet’s noun hierarchy and other ontologies, illustrating that HiTs significantly outperform the pre-trained and fine-tuned LMs.

Analysis and Future Work

The analysis of HiT embeddings presents a clear distribution of WordNet entity embeddings concerning their hyperbolic norms, indicating an effective capture of hierarchical expansions. Selected cases provide evidence of HiT’s effectiveness, with more specific entities located further from the manifold’s origin, suggesting a clear hierarchical positioning. This underlines HiTs’ potential in hierarchy-oriented semantic search, earmarked as a direction for future exploration.

In conclusion, the paper presents an innovative approach to extend the capabilities of LMs for better encoding hierarchical structures. The introduction of HiTs represents a promising direction in the deployment of LMs for tasks demanding an understanding of complex semantic hierarchies, demonstrating the potential for significant advancements in hierarchy-oriented applications within NLP.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com