Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 164 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation (2109.01048v3)

Published 2 Sep 2021 in cs.CL

Abstract: Existing technologies expand BERT from different perspectives, e.g. designing different pre-training tasks, different semantic granularities, and different model architectures. Few models consider expanding BERT from different text formats. In this paper, we propose a heterogeneous knowledge LLM (\textbf{HKLM}), a unified pre-trained LLM (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. To capture the corresponding relations among these multi-format knowledge, our approach uses masked LLM objective to learn word knowledge, uses triple classification objective and title matching objective to learn entity knowledge and topic knowledge respectively. To obtain the aforementioned multi-format text, we construct a corpus in the tourism domain and conduct experiments on 5 tourism NLP datasets. The results show that our approach outperforms the pre-training of plain text using only 1/4 of the data. We further pre-train the domain-agnostic HKLM and achieve performance gains on the XNLI dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Inferential statistics. Air Medical Journal 28, 168–171.
  2. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. CoRR abs/1908.10063. arXiv:1908.10063.
  3. SciBERT: A Pretrained Language Model for Scientific Text, in: Proceedings of EMNLP-IJCNLP.
  4. Translating Embeddings for Modeling Multi-relational Data, in: Proceedings of NIPS.
  5. Language Models are Few-Shot Learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Proceedings of NeurIPS.
  6. Probability theory and mathematical statistics. University of Science and Technology of China Press.
  7. Finding needles in the haystack: Search and candidate generation. IBM J. Res. Dev. 56, 6.
  8. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, in: Proceedings of ICLR.
  9. XNLI: evaluating cross-lingual sentence representations, in: Proceedings of EMNLP, pp. 2475--2485.
  10. Pre-training with whole word masking for chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3504--3514.
  11. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context, in: Proceedings of ACL.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of NAACL-HLT.
  13. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. CoRR abs/2007.15779.
  14. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, in: Proceedings of ACL.
  15. Universal Language Model Fine-tuning for Text Classification, in: Proceedings of ACL.
  16. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. CoRR abs/1904.05342.
  17. Fine-grained entity typing via hierarchical multi graph convolutional networks, in: Proceedings of EMNLP-IJCNLP.
  18. XLORE2: large-scale cross-lingual knowledge graph construction and application. Data Intell. 1, 77--98.
  19. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, 453--466.
  20. Neural Architectures for Named Entity Recognition, in: Knight, K., Nenkova, A., Rambow, O. (Eds.), Proceedings of NAACL-HLT.
  21. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinform. 36, 1234--1240.
  22. Patentbert: Patent classification with fine-tuning a pre-trained bert model. arXiv preprint arXiv:1906.02124 .
  23. A diversity-promoting objective function for neural conversation models, in: Proceedings of NAACL-HLT.
  24. Entity-Relation Extraction as Multi-Turn Question Answering, in: Proceedings of ACL.
  25. K-BERT: Enabling Language Representation with Knowledge Graph, in: Proceedings of AAAI.
  26. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692.
  27. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining, in: Bessiere, C. (Ed.), Proceedings of IJCAI.
  28. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages, in: Proceedings of ACL.
  29. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 .
  30. Multi-grained dependency graph neural network for chinese open information extraction, in: Proceedings of PAKDD, Springer. pp. 155--167.
  31. NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned. arXiv preprint arXiv:2101.00133 .
  32. MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning. CoRR abs/2004.10220.
  33. Language models as knowledge bases?, in: Proceedings EMNLP-IJCNLP, pp. 2463--2473.
  34. Document structure. Computational Linguistics 29, 211--260.
  35. ZORE: A Syntax-based System for Chinese Open Relation Extraction, in: Proceedings of ACL.
  36. Improving language understanding by generative pre-training .
  37. Language models are unsupervised multitask learners. OpenAI blog 1, 9.
  38. Improved Pretraining for Domain-specific Contextual Embedding Models. CoRR abs/2004.02288.
  39. Logician: A unified end-to-end neural approach for open-domain information extraction, in: Proceedings of WSDM, ACM. pp. 556--564.
  40. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. CoRR abs/2107.02137. arXiv:2107.02137.
  41. ERNIE: Enhanced Representation through Knowledge Integration. CoRR abs/1904.09223.
  42. exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources, in: Proceedings of EMNLP.
  43. Estimating the selectivity of tf-idf based cosine similarity predicates. ACM Sigmod Record 36, 7--12.
  44. K-adapter: Infusing knowledge into pre-trained models with adapters, in: Findings of ACL-IJCNLP, pp. 1405--1418.
  45. KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. CoRR abs/1911.06136.
  46. No Answer is Better Than Wrong Answer: A Reflection Model for Document Level Machine Reading Comprehension, in: Cohn, T., He, Y., Liu, Y. (Eds.), Proceedings of EMNLP.
  47. Construction of multimodal chinese tourism knowledge graph, in: Proceedings of ICPCSEE, Springer. pp. 16--29.
  48. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model, in: Proceedings of ICLR.
  49. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention, in: Webber, B., Cohn, T., He, Y., Liu, Y. (Eds.), Proceedings of EMNLP.
  50. KG-BERT: BERT for knowledge graph completion. CoRR abs/1909.03193. arXiv:1909.03193.
  51. DocRED: A Large-Scale Document-Level Relation Extraction Dataset, in: Proceedings of ACL.
  52. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data, in: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R. (Eds.), Proceedings of ACL.
  53. ERNIE: Enhanced Language Representation with Informative Entities, in: Proceedings of ACL.
  54. KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation, in: Proceedings of ACL.
Citations (20)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.