Distilling Named Entity Recognition Models for Endangered Species from Large Language Models (2403.15430v1)
Abstract: Natural language processing (NLP) practitioners are leveraging LLMs (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
- Large language models are few-shot clinical information extractors.
- Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Applications for deep learning in ecology.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Research topics and trends of endangered species using text mining in korea.
- Structured information extraction from complex scientific text with fine-tuned large language models.
- Domain-specific language model pretraining for biomedical natural language processing.
- Thinking about gpt-3 in-context learning for biomedical ie? think again. arXiv preprint arXiv:2203.08410.
- Distilling the knowledge in a neural network.
- Co-training improves prompt-based learning for large language models.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Improving multi-task deep neural networks via knowledge distillation for natural language understanding.
- OpenAI. 2023. Gpt-4 technical report.
- Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access.
- Language models in the loop: Incorporating prompting into weak supervision.
- Matthew C Swain and Jacqueline M Cole. 2016. Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. Journal of chemical information and modeling, 56(10):1894–1904.
- Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487.
- Universalner: Targeted distillation from large language models for open named entity recognition.