Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision (2402.03362v1)

Published 30 Jan 2024 in cs.IR, cs.AI, and cs.CL

Abstract: Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in NLP and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant supervision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal manpower , we believe that it can be generalized to other specialized fields.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Polyglot-ner: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 586–594. SIAM.
  2. Scibert: A pretrained language model for scientific text.
  3. The environment ontology: contextualising biological and biomedical entities. Journal of biomedical semantics, 4:43.
  4. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36(suppl_1):D344–D350.
  5. A multiclass classification method based on deep learning for named entity recognition in electronic medical records. In 2016 New York Scientific Data Summit (NYSDS), pages 1–10.
  6. Biomedical and chemical named entity recognition with conditional random fields: The advantage of dictionary features. In SMBM.
  7. Swellshark: A generative model for biomedical named entity recognition without labeled data. arXiv preprint arXiv:1704.06360.
  8. The chemical information ontology: Provenance and disambiguation for chemical data on the biological semantic web. PLOS ONE, 6(10):1–13.
  9. enanomapper: harnessing ontologies to enable data integration for nanomaterial risk assessment. Journal of biomedical semantics, 6(1):1–15.
  10. Biomedical named entity recognition and linking datasets: survey and our recent development. Briefings in Bioinformatics, 21(6):2219–2238.
  11. Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 24–31.
  12. Named entity recognition from biomedical text using svm. In 2011 5th international conference on bioinformatics and biomedical engineering, pages 1–4. IEEE.
  13. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
  14. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  15. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016:baw068.
  16. Bond: Bert-assisted open-domain named entity recognition with distant supervision. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM.
  17. Patrice Lopez. 2008-2023. Grobid. https://github.com/kermitt2/grobid.
  18. Building nanostructures with drugs. Nano Today, 11(1):13–30.
  19. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. arXiv preprint arXiv:2109.05003.
  20. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
  21. Ines Montani and Explosion AI team. 2023. Prodigy - an annotation tool for machine learning and data science. Website. https://prodi.gy/.
  22. Biomedical named entity recognition: a poor knowledge hmm-based approach. In Natural Language Processing and Information Systems: 12th International Conference on Applications of Natural Language to Information Systems, NLDB 2007, Paris, France, June 27-29, 2007. Proceedings 12, pages 382–387. Springer.
  23. Conditional random fields vs. hidden markov models in a biomedical named entity recognition task. In Proc. of Int. Conf. Recent Advances in Natural Language Processing, RANLP, pages 479–483.
  24. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition.
  25. Learning named entity tagger using domain-specific dictionary. arXiv preprint arXiv:1809.03599.
  26. Nanoparticle ontology for cancer nanotechnology research. Journal of Biomedical Informatics, 44(1):59–74. Ontologies for Clinical and Translational Research.
  27. Chemner: fine-grained chemistry named entity recognition with ontology-guided distant supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  28. Improving biomedical pretrained language models with knowledge. arXiv preprint arXiv:2104.10344.
  29. Optimizing bi-encoder for named entity recognition via contrastive learning. arXiv preprint arXiv:2208.14565.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Martin Lentschat (3 papers)
  2. Cyril Labbé (36 papers)
  3. Ran Cheng (130 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com