Learning High-Quality and General-Purpose Phrase Representations (2401.10407v2)
Abstract: Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained LLMs for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. [PEARL-small]: https://huggingface.co/Lihuchen/pearl_small; [PEARL-base]: https://huggingface.co/Lihuchen/pearl_base; [Code and Dataset]: https://github.com/tigerchen52/PEARL
- Big BiRD: A large, fine-grained, bigram relatedness dataset for examining semantic composition. In Proc. of NAACL-HLT.
- Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, (suppl_1).
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proc. of EMNLP.
- Imputing out-of-vocabulary embeddings with LOVE makes LanguageModels robust with little cost. In Proc. of ACL.
- A simple framework for contrastive learning of visual representations. In Proc. of ICML, Proceedings of Machine Learning Research.
- Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering, (9).
- McPhraSy: Multi-context phrase similarity and clustering. In Findings of the Association for Computational Linguistics: EMNLP 2022.
- Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for nlu.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proc. of EMNLP.
- Deberta: decoding-enhanced bert with disentangled attention. In Proc. of ICLR.
- The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.
- Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of compositional semantics. In Proc. of ACL.
- Learning deep representations by mutual information estimation and maximization. In Proc. of ICLR.
- OntoNotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, (3).
- SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
- Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proc. of ACL.
- ALBERT: A lite BERT for self-supervised learning of language representations. In Proc. of ICLR.
- Learning dense representations of phrases at scale. In Proc. of ACL.
- Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, (2).
- UCTopic: Unsupervised contrastive learning for phrase representations and topic mining. In Proc. of ACL.
- Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database.
- Auto-fuzzyjoin: auto-program fuzzy similarity joins without labeled examples. In Proceedings of the 2021 International Conference on Management of Data.
- Roberta: A robustly optimized bert pretraining approach. ArXiv preprint.
- James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 14. Oakland, CA, USA.
- Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation.
- George A. Miller. 1992. WordNet: A lexical database for English. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.
- Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of ACL-08: HLT.
- Ontology matching: A literature review. Expert Systems with Applications, (2).
- Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), (2).
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada.
- PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proc. of ACL.
- Yago 4: A reason-able knowledge base. In The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings 17. Springer.
- GloVe: Global vectors for word representation. In Proc. of EMNLP.
- Combating adversarial misspellings with robust word recognition. In Proc. of ACL.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proc. of EMNLP.
- Semantic compositionality through recursive matrix-vector spaces. In Proc. of EMNLP.
- Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop. Vancouver.
- Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
- Peter D Turney. 2012. Domain and function: A dual-space model of semantic relations and compositions. Journal of artificial intelligence research.
- Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, (10).
- Text embeddings by weakly-supervised contrastive pre-training. ArXiv preprint.
- Phrase-BERT: Improved phrase embeddings from BERT with an application to corpus exploration. In Proc. of EMNLP.
- MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
- Transformers: State-of-the-art natural language processing. In Proc. of EMNLP.
- LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proc. of EMNLP.
- Lang Yu and Allyson Ettinger. 2020. Assessing phrasal representation and composition in transformers. In Proc. of EMNLP.
- String similarity search and join: a survey. Frontiers of Computer Science.
- Learning composition models for phrase embeddings. Transactions of the Association for Computational Linguistics.
- An experimental study of state-of-the-art entity alignment approaches. TKDE, (6).
- Learning phrase embeddings from paraphrases with GRUs. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora.