Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

INDUS: Effective and Efficient Language Models for Scientific Applications (2405.10725v3)

Published 17 May 2024 in cs.CL and cs.IR

Abstract: LLMs trained on general domain corpora showed remarkable results on NLP tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, CLIMATE-CHANGE NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings -- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.
  2. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. On the use of arxiv as a dataset.
  5. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. Searchqa: A new q&a dataset augmented with context from a search engine.
  8. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, page 1156–1165, New York, NY, USA. Association for Computing Machinery.
  9. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
  11. Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning Worksop.
  12. The diminishing returns of masked language models to science.
  13. Shu Huang and Jacqueline M Cole. 2022. Batterybert: A pretrained language model for battery database enhancement. J. Chem. Inf. Model., page DOI: 10.1021/acs.jcim.2c00035.
  14. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  15. Supervised contrastive learning. In Advances in Neural Information Processing Systems, volume 33, pages 18661–18673. Curran Associates, Inc.
  16. The semantic scholar open data platform.
  17. Natural Questions: A Benchmark for Question Answering Research. Transactions of the ACL.
  18. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  19. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  20. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
  21. Towards general text embeddings with multi-stage contrastive learning.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  23. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
  24. Language models are unsupervised multitask learners.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1).
  26. Know what you don’t know: Unanswerable questions for squad. CoRR, abs/1806.03822.
  27. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP.
  28. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  29. Novel tau biomarkers phosphorylated at t181, t217 or t231 rise in the initial stages of the preclinical alzheimer&#x2019;s <i>continuum</i> when only subtle changes in a&#x3b2; pathology are detected. EMBO Molecular Medicine, 12(12):e12921.
  30. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.
  31. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  32. Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4).
  33. Llama: Open and efficient foundation language models.
  34. Neural architecture search for effective teacher-student knowledge transfer in language models. arXiv preprint arXiv:2303.09639.
  35. A comparative analysis of task-agnostic distillation methods for compressing transformer language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 20–31, Singapore. Association for Computational Linguistics.
  36. Representation learning with contrastive predictive coding.
  37. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  38. The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755.
  39. Text embeddings by weakly-supervised contrastive pre-training.
  40. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
  41. A theoretical analysis of ndcg type ranking measures. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 25–54, Princeton, NJ, USA. PMLR.
  42. Bloomberggpt: A large language model for finance.
  43. RetroMAE: Pre-training retrieval-oriented language models via masked auto-encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 538–548, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  44. C-pack: Packaged resources to advance general chinese embedding.
  45. DistillCSE: Distilled contrastive learning for sentence embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8153–8165, Singapore. Association for Computational Linguistics.
  46. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
Citations (4)

Summary

  • The paper introduces Indus, a suite of language models optimized for multi-disciplinary scientific research through domain-tailored tokenization and contrastive learning.
  • The encoder models, built on curated corpora from sources like NASA and PubMed, consistently outperform general-purpose models on specialized benchmarks.
  • The paper demonstrates effective knowledge distillation techniques that produce smaller, faster models without significant accuracy loss for resource-limited applications.

Tailoring LLMs for Multi-Disciplinary Scientific Research

Introduction to LLMs in Specialized Domains

LLMs have stormed the NLP scene, showcasing impressive feats in understanding and generating human language. Most of these models—think RoBERTa, BERT, and GPT-3—are usually built on general-purpose corpora. However, researchers have realized that such general-purpose models might not always shine in domain-specific tasks, especially when the vocabulary and context vary significantly from everyday language.

This is where Indus steps in. Unlike its predecessors, Indus focuses on tailoring LLMs for specific scientific domains including Earth Science, Biology, Physics, Heliophysics, Planetary Sciences, and Astrophysics. The primary goal? To create models fine-tuned to excel in tasks relevant to these fields.

The Indus Suite: Specialized Models and Benchmarks

Customized Tokenizer: IndusBPE

Tokenization is the bedrock of NLP models. The Indus team made sure to create IndusBPE, a tokenizer tailored explicitly for scientific text. By designing this tokenizer using scientific literature, they've ensured that complex, technical terms are treated as single tokens rather than fragmented subwords. This results in more efficient and contextually accurate processing of specialized content.

Encoder Models

The Indus suite includes multiple encoder models designed to excel in natural language understanding for the scientific domains mentioned earlier. These models are trained using a meticulously curated scientific corpus that draws from sources like NASA's Common Metadata Repository, PubMed Central, and the American Meteorological Society. The focus here is on capturing the intricacies of domain-specific language, aiming to outperform general-purpose models like RoBERTa and even domain-specific ones like SciBERT.

Contrastive Learning for Text Embeddings

A standout component of Indus is its use of contrastive learning to develop text embedding models. These models create "universal" sentence embeddings that can be used for information retrieval tasks. The technique pushes similar sentences closer together in the embedding space and differentiates them from dissimilar ones, making it highly effective for identifying relevant passages in scientific literature.

Smaller, Faster Models via Knowledge Distillation

For use cases where computational resources are limited, smaller versions of the Indus models are available. Built using knowledge distillation techniques, these lightweight models provide faster, yet still robust, performance.

New Benchmark Datasets

To gauge the efficiency of these models, the researchers also introduced three new benchmark datasets:

  • Climate-Change NER: Focused on named entity recognition in climate-related literature.
  • NASA-QA: An extractive question-answering dataset centered around Earth science.
  • NASA-IR: A retrieval task dataset spanning multiple scientific domains.

These benchmarks are aimed at accelerating research and ensuring models are tested on relevant, real-world scientific queries.

Results and Performance

Indus models demonstrated strong performance across the board. Compared to general-purpose models like RoBERTa and even specialized models like SciBERT, Indus consistently outperformed on various tasks, including those in the new benchmark datasets.

Climate-Change NER: Indus achieved an F1 score of 64.0, significantly outperforming RoBERTa (60.8) and SciBERT (61.8).

NASA-QA: Indus clocked in at an F1 score of 68.2, leading the pack against RoBERTa (66.8) and SciBERT (63.5).

NASA-IR: In terms of Recall@10, Indus achieved a score of 0.73, better than RoBERTa (0.66) and the baseline embedding models.

Moreover, smaller, distilled versions of Indus also exhibited commendable performance improvements. For instance, the mini versions provided faster retrieval times while retaining strong accuracy, proving to be effective in resource-constrained environments.

Implications and Future Work

The tailored approach of the Indus suite has broad implications for scientific research. By delivering models that better understand and process specialized vocabularies and contexts, researchers and organizations can expect more relevant, precise, and actionable insights from their NLP applications.

Moving forward, integrating these models into scientific workflows could enhance literature reviews, data retrieval, and even grant applications by making it easier and faster to find relevant information.

The development of benchmarks also sets a new standard for evaluating NLP models in scientific domains, encouraging continuous improvement and innovation in this space.

While the current suite covers several scientific fields, there is always room for expansion. Future work could focus on additional domains or languages, making Indus a go-to resource for a wider array of scientific disciplines and global research communities. The interplay of these models with emerging AI technologies and interdisciplinary applications could also be an exciting area of exploration.

In sum, Indus represents a step forward in fine-tuning NLP models for specialized domains, offering promising tools for the scientific community to advance their research more effectively.