Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem (2402.16159v5)

Published 25 Feb 2024 in cs.CL

Abstract: With the AI revolution in place, the trend for building automated systems to support professionals in different domains such as the open source software systems, healthcare systems, banking systems, transportation systems and many others have become increasingly prominent. A crucial requirement in the automation of support tools for such systems is the early identification of named entities, which serves as a foundation for developing specialized functionalities. However, due to the specific nature of each domain, different technical terminologies and specialized languages, expert annotation of available data becomes expensive and challenging. In light of these challenges, this paper proposes a novel named entity recognition (NER) technique specifically tailored for the open-source software systems. Our approach aims to address the scarcity of annotated software data by employing a comprehensive two-step distantly supervised annotation process. This process strategically leverages language heuristics, unique lookup tables, external knowledge sources, and an active learning approach. By harnessing these powerful techniques, we not only enhance model performance but also effectively mitigate the limitations associated with cost and the scarcity of expert annotators. It is noteworthy that our model significantly outperforms the state-of-the-art LLMs by a substantial margin. We also show the effectiveness of NER in the downstream task of relation extraction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Progressive learning for systematic design of large neural networks.
  2. TEBNER: Domain specific named entity recognition with type expanded boundary-aware network. pages 198–207, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  3. Joint autoregressive and graph models for software and developer social networks. In Advances in Information Retrieval, pages 224–237, Cham. Springer International Publishing.
  4. Is this bug severe? a text-cum-graph based model for bug severity prediction. In Machine Learning and Knowledge Discovery in Databases, pages 236–252, Cham. Springer Nature Switzerland.
  5. Bidirectional lstm-crf models for sequence tagging. ArXiv, abs/1508.01991.
  6. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet, 7(2):119–129.
  7. Better modeling of incomplete annotations for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 729–734, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. ApplicaAI at SemEval-2020 task 11: On RoBERTa-CRF, span CLS and whether self-training helps them. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1415–1424, Barcelona (online). International Committee for Computational Linguistics.
  9. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics, 7(1):S2.
  10. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  11. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1):50–70.
  12. A unified MRC framework for named entity recognition. pages 5849–5859, Online. Association for Computational Linguistics.
  13. Bond: Bert-assisted open-domain named entity recognition with distant supervision. KDD ’20, page 1054–1064, New York, NY, USA. Association for Computing Machinery.
  14. Hamner: Headword amplified multi-span distantly supervised method for domain specific named entity recognition.
  15. Ner-bert: A pre-trained model for low-resource entity tagging.
  16. An evaluation of progressive neural networksfor transfer learning in natural language processing. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1376–1381, Marseille, France. European Language Resources Association.
  17. Overview of bionlp shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7, Sofia, Bulgaria. Association for Computational Linguistics.
  18. Ipek Ozkaya. 2023. Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software, 40(3):4–8.
  19. Distantly supervised named entity recognition using positive-unlabeled learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2409–2419, Florence, Italy. Association for Computational Linguistics.
  20. Named entity recognition and relation detection for biomedical information extraction. Frontiers in Cell and Developmental Biology, 8.
  21. Improving adverse drug event extraction with spanbert on different text typologies.
  22. Is chatgpt a general-purpose natural language processing task solver?
  23. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering, 30(10):1825–1837.
  24. NER-MQMRC: Formulating named entity recognition as multi question machine reading comprehension. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 230–238, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
  25. Portuguese named entity recognition using bert-crf.
  26. Code and named entity recognition in StackOverflow. pages 4913–4926, Online. Association for Computational Linguistics.
  27. Attensy-sner: software knowledge entity extraction with syntactic features and semantic augmentation information. Complex & Intelligent Systems, 9(1):25–39.
  28. Named entity recognition in historic legal text: A transformer and state machine ensemble method. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 172–179, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  29. Jingxuan Tu and Constantine Lignos. 2021. TMR: Evaluating NER recall on tough mentions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 155–163, Online. Association for Computational Linguistics.
  30. Gpt-ner: Named entity recognition via large language models.
  31. Named entity and relation extraction with multi-modal retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5925–5936, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Improving named entity recognition by external context retrieving and cooperative learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1800–1812, Online. Association for Computational Linguistics.
  33. Pattern-enhanced named entity recognition with distant supervision. In 2020 IEEE International Conference on Big Data (Big Data), pages 818–827.
  34. Distantly supervised biomedical named entity recognition with dictionary expansion. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 496–503.
  35. Harnessing the power of llms in practice: A survey on chatgpt and beyond.
  36. Software-specific named entity recognition in software engineering social content. volume 1, pages 90–101.
  37. Yunyi Zhang Yu Meng. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. Proc. 2021 Conf. on Empirical Methods in Natural Language Processing (EMNLP’21).
  38. Shellfusion: Answer generation for shell programming tasks via knowledge fusion. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pages 1970–1981.
  39. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
  40. Improving software bug-specific named entity recognition with deep neural network. Journal of Systems and Software, 165:110572.
  41. Universalner: Targeted distillation from large language models for open named entity recognition.
Citations (1)

Summary

We haven't generated a summary for this paper yet.