INDUS: Effective and Efficient Language Models for Scientific Applications (2405.10725v3)
Abstract: LLMs trained on general domain corpora showed remarkable results on NLP tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, CLIMATE-CHANGE NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings -- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
- Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- On the use of arxiv as a dataset.
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Searchqa: A new q&a dataset augmented with context from a search engine.
- Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, page 1156–1165, New York, NY, USA. Association for Computing Machinery.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
- Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning Worksop.
- The diminishing returns of masked language models to science.
- Shu Huang and Jacqueline M Cole. 2022. Batterybert: A pretrained language model for battery database enhancement. J. Chem. Inf. Model., page DOI: 10.1021/acs.jcim.2c00035.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, pages 18661–18673. Curran Associates, Inc.
- The semantic scholar open data platform.
- Natural Questions: A Benchmark for Question Answering Research. Transactions of the ACL.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
- Towards general text embeddings with multi-stage contrastive learning.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1).
- Know what you don’t know: Unanswerable questions for squad. CoRR, abs/1806.03822.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Novel tau biomarkers phosphorylated at t181, t217 or t231 rise in the initial stages of the preclinical alzheimer’s <i>continuum</i> when only subtle changes in aβ pathology are detected. EMBO Molecular Medicine, 12(12):e12921.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
- Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4).
- Llama: Open and efficient foundation language models.
- Neural architecture search for effective teacher-student knowledge transfer in language models. arXiv preprint arXiv:2303.09639.
- A comparative analysis of task-agnostic distillation methods for compressing transformer language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 20–31, Singapore. Association for Computational Linguistics.
- Representation learning with contrastive predictive coding.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755.
- Text embeddings by weakly-supervised contrastive pre-training.
- MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
- A theoretical analysis of ndcg type ranking measures. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 25–54, Princeton, NJ, USA. PMLR.
- Bloomberggpt: A large language model for finance.
- RetroMAE: Pre-training retrieval-oriented language models via masked auto-encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 538–548, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- C-pack: Packaged resources to advance general chinese embedding.
- DistillCSE: Distilled contrastive learning for sentence embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8153–8165, Singapore. Association for Computational Linguistics.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.