Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition (2109.07877v2)

Published 16 Sep 2021 in cs.CL

Abstract: In Chinese Named Entity Recognition, character substitution is a complicated linguistic phenomenon. Some Chinese characters are quite similar as they share the same components or have similar pronunciations. People replace characters in a named entity with similar characters to generate a new collocation but referring to the same object. As a result, it always leads to unrecognizable or mislabeling errors in the NER task. In this paper, we propose a lightweight method, MFE-NER, which fuses glyph and phonetic features, to help pre-trained LLMs handle the character substitution problem in the NER task with limited extra cost. Basically, in the glyph domain, we disassemble Chinese characters into Five-Stroke components to represent structure features. In the phonetic domain, an improved phonetic system is proposed in our work, making it reasonable to describe phonetic similarity among Chinese characters. Experiments demonstrate that our method performs especially well in detecting character substitutions while slightly improving the overall performance of Chinese NER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Qing Cai. 2019. Research on chinese naming recognition model based on bert embedding. In 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS), pages 1–4. IEEE.
  2. James Hammerton. 2003. Named entity recognition with long short-term memory. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 172–175.
  3. Hangfeng He and Xu Sun. 2017. F-score driven max margin neural network for named entity recognition in chinese social media. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 713–718.
  4. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv e-prints, pages arXiv–1508.
  5. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  6. 2013. Efficient estimation of word representations in vector space. Computer Science.
  7. 2015. Named entity recognition for chinese social media with jointly trained embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 548–554.
  8. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  9. 2017. Learning chinese word representations from glyphs of characters. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 264–273.
  10. 2021. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information. arXiv e-prints, pages arXiv–2106.
  11. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  12. 2011. Ontonotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium.
  13. 2018. Five-stroke based cnn-birnn-crf network for chinese named entity recognition. In Min Zhang, Vincent Ng, Dongyan Zhao, Sujian Li, and Hongying Zan, editors, Natural Language Processing and Chinese Computing, pages 184–195, Cham. Springer International Publishing.
  14. 2021. Incorporating lexicon and character glyph and morphological features into bilstm-crf for chinese medical ner. In 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), pages 12–17. IEEE.
  15. 2018. Chinese ner using lattice lstm. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1554–1564.
  16. 2021. Smedbert: A knowledge-enhanced pre-trained language model with structured semantics for medical text mining. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jiatong Li (47 papers)
  2. Kui Meng (3 papers)
Citations (10)