Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data (2402.15343v1)

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have shown impressive abilities in data annotation, opening the way for new approaches to solve classic NLP problems. In this paper, we show how to use LLMs to create NuNER, a compact language representation model specialized in the Named Entity Recognition (NER) task. NuNER can be fine-tuned to solve downstream NER problems in a data-efficient way, outperforming similar-sized foundation models in the few-shot regime and competing with much larger LLMs. We find that the size and entity-type diversity of the pre-training dataset are key to achieving good performance. We view NuNER as a member of the broader family of task-specific foundation models, recently unlocked by LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Scibert: A pretrained language model for scientific text. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  3613–3618. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1371. URL https://doi.org/10.18653/v1/D19-1371.
  2. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  3. Low-resource name tagging learned with weakly labeled data. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  261–270, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1025. URL https://aclanthology.org/D19-1025.
  4. A Simple Framework for Contrastive Learning of Visual Representations, November 2020. URL https://proceedings.mlr.press/v119/chen20j.html.
  5. Introduction to the bio-entity recognition task at JNLPBA. In Collier, N., Ruch, P., and Nazarenko, A. (eds.), Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp.  73–78, Geneva, Switzerland, August 28th and 29th 2004. COLING. URL https://aclanthology.org/W04-1213.
  6. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  7. Few-nerd: A few-shot named entity recognition dataset. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  3198–3213. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-LONG.248. URL https://doi.org/10.18653/v1/2021.acl-long.248.
  8. Named entity recognition and resolution in legal text. In Francesconi, E., Montemagni, S., Peters, W., and Tiscornia, D. (eds.), Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, volume 6036 of Lecture Notes in Computer Science, pp.  27–43. Springer, 2010. doi: 10.1007/978-3-642-12837-0_2. URL https://doi.org/10.1007/978-3-642-12837-0_2.
  9. Transfer learning for named entity recognition in financial and biomedical documents. Inf., 10(8):248, 2019. doi: 10.3390/INFO10080248. URL https://doi.org/10.3390/info10080248.
  10. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, July 2023. URL https://www.pnas.org/doi/10.1073/pnas.2305016120.
  11. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=XPZIaotutsD.
  12. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  13. Adam: A Method for Stochastic Optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  14. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, September 2019. ISSN 1367-4811. doi: 10.1093/bioinformatics/btz682. URL http://dx.doi.org/10.1093/bioinformatics/btz682.
  15. Type-aware decomposed framework for few-shot named entity recognition. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  8911–8927, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-emnlp.598.
  16. Asgard: A portable architecture for multilingual dialogue systems. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pp.  8386–8390. IEEE, 2013. doi: 10.1109/ICASSP.2013.6639301. URL https://doi.org/10.1109/ICASSP.2013.6639301.
  17. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  18. NER-BERT: A pre-trained model for low-resource entity tagging. CoRR, abs/2112.00405, 2021. URL https://arxiv.org/abs/2112.00405.
  19. Gpteval: A survey on assessments of chatgpt and GPT-4. CoRR, abs/2308.12488, 2023. doi: 10.48550/ARXIV.2308.12488. URL https://doi.org/10.48550/arXiv.2308.12488.
  20. UMAP: uniform manifold approximation and projection for dimension reduction. CoRR, abs/1802.03426, 2018. URL http://arxiv.org/abs/1802.03426.
  21. Coarse-to-Fine Pre-training for Named Entity Recognition. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6345–6354, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.514. URL https://aclanthology.org/2020.emnlp-main.514.
  22. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  24. Gollie: Annotation guidelines improve zero-shot information-extraction. CoRR, abs/2310.03668, 2023. doi: 10.48550/ARXIV.2310.03668. URL https://doi.org/10.48550/arXiv.2310.03668.
  25. Attention is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.  5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.
  26. GPT-NER: named entity recognition via large language models. CoRR, abs/2304.10428, 2023a. doi: 10.48550/ARXIV.2304.10428. URL https://doi.org/10.48550/arXiv.2304.10428.
  27. Instructuie: Multi-task instruction tuning for unified information extraction. CoRR, abs/2304.08085, 2023b. doi: 10.48550/ARXIV.2304.08085. URL https://doi.org/10.48550/arXiv.2304.08085.
  28. OntoNotes Release 5.0, 2013. URL https://hdl.handle.net/11272.1/AB2/MKJJ2R.
  29. Gliner: Generalist model for named entity recognition using bidirectional transformer. CoRR, abs/2311.08526, 2023. doi: 10.48550/ARXIV.2311.08526. URL https://doi.org/10.48550/arXiv.2311.08526.
  30. Optimizing bi-encoder for named entity recognition via contrastive learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=9EAQVEINuum.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023. doi: 10.48550/ARXIV.2306.05685. URL https://doi.org/10.48550/arXiv.2306.05685.
  32. Universalner: Targeted distillation from large language models for open named entity recognition. CoRR, abs/2308.03279, 2023. doi: 10.48550/ARXIV.2308.03279. URL https://doi.org/10.48550/arXiv.2308.03279.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sergei Bogdanov (4 papers)
  2. Alexandre Constantin (1 paper)
  3. Timothée Bernard (2 papers)
  4. Etienne Bernard (8 papers)
  5. Benoit Crabbé (2 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com