Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches (2307.00130v1)

Published 30 Jun 2023 in cs.CL

Abstract: Information extraction (IE) plays very important role in NLP and is fundamental to many NLP applications that used to extract structured information from unstructured text data. Heuristic-based searching and data-driven learning are two main stream implementation approaches. However, no much attention has been paid to document genre and length influence on IE tasks. To fill the gap, in this study, we investigated the accuracy and generalization abilities of heuristic-based searching and data-driven to perform two IE tasks: named entity recognition (NER) and semantic role labeling (SRL) on domain-specific and generic documents with different length. We posited two hypotheses: first, short documents may yield better accuracy results compared to long documents; second, generic documents may exhibit superior extraction outcomes relative to domain-dependent documents due to training document genre limitations. Our findings reveals that no single method demonstrated overwhelming performance in both tasks. For named entity extraction, data-driven approaches outperformed symbolic methods in terms of accuracy, particularly in short texts. In the case of semantic roles extraction, we observed that heuristic-based searching method and data-driven based model with syntax representation surpassed the performance of pure data-driven approach which only consider semantic information. Additionally, we discovered that different semantic roles exhibited varying accuracy levels with the same method. This study offers valuable insights for downstream text mining tasks, such as NER and SRL, when addressing various document features and genres.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “A review of relation extraction” In Literature review for Language and Statistics II 2, 2007, pp. 1–15
  2. “English propbank annotation guidelines” In Center for Computational Language and Education Research Institute of Cognitive Science University of Colorado at Boulder 48, 2012
  3. “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
  4. Danqi Chen and Christopher D Manning “A fast and accurate dependency parser using neural networks” In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 740–750
  5. “Universal Dependency Relations”, 2022 URL: https://universaldependencies.org/u/dep/index.html
  6. Charles J Fillmore “Frame semantics” In Cognitive linguistics: Basic readings 34 Mouton de Gruyter Berlin, 2006, pp. 373–400
  7. “Brain Inflammation, Degeneration, and Plasticity in Multiple Sclerosis” Elsevier, 2015
  8. huggingface “Dataset Card for CoNLL2012 shared task data based on OntoNotes 5.0”, 2022 URL: https://huggingface.co/datasets/conll2012_ontonotesv5
  9. huggingface “Preprocess”, 2022 URL: https://huggingface.co/docs/transformers/preprocessing
  10. Dan Jurafsky and James H Martin “Speech and language processing. Vol. 3” In US: Prentice Hall, 2014
  11. “Knowledge Graph Anchored Information-Extraction for Domain-Specific Insights” In arXiv preprint arXiv:2104.08936, 2021
  12. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  13. Ora Lassila and Ralph R Swick “Resource description framework (RDF) model and syntax specification” Citeseer, 1998
  14. LDC “OntoNotes Release 5.0”, 2022 URL: https://catalog.ldc.upenn.edu/LDC2013T19
  15. “A survey on deep learning for named entity recognition” In IEEE Transactions on Knowledge and Data Engineering 34.1 IEEE, 2020, pp. 50–70
  16. “Syntax Role for Neural Semantic Role Labeling” In Computational Linguistics 47.3 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021, pp. 529–574
  17. “A survey of named entity recognition and classification” In Lingvisticae Investigationes 30.1 John Benjamins, 2007, pp. 3–26
  18. Rainer Osswald and Robert D Van Valin “FrameNet, frame structure, and the syntax-semantics interface” In Frames and concept types: Applications in language and philosophy Springer, 2014, pp. 125–156
  19. Nadeesha Perera, Matthias Dehmer and Frank Emmert-Streib “Named entity recognition and relation detection for biomedical information extraction” In Frontiers in cell and developmental biology Frontiers, 2020, pp. 673
  20. J.K. Rowling “Harry Potter and the Sorcerer’s Stone” In Bloomsbury, 1997
  21. “A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate” In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019, pp. 338–343 IEEE
  22. “Simple bert models for relation extraction and semantic role labeling” In arXiv preprint arXiv:1904.05255, 2019
  23. Roger P Simon, Michael Jeffrey Aminoff and David A Greenberg “Clinical neurology” Lange Medical Books/McGraw-Hill, 2009
  24. spaCy “Language support”, 2022 URL: https://spacy.io/usage/models#languages
  25. “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
  26. Maximilian Vierlboeck, Carlo Lipizzi and Roshanak Nilchiani “Natural Language in Requirements Engineering for Structure Inference–An Integrative Review” In arXiv preprint arXiv:2202.05065, 2022
  27. Pete Wells “Restaurant Review: A Magnet for Wine Nerds Gets a Recharge”, 2022 URL: https://www.nytimes.com/2022/09/12/dining/chambers-review-pete-wells.html
  28. Rongen Yan, Xue Jiang and Depeng Dang “Named entity recognition by using XLNet-BiLSTM-CRF” In Neural Processing Letters 53.5 Springer, 2021, pp. 3339–3356
  29. “5G RRC Protocol and Stack Vulnerabilities Detection via Listen-and-Learn” In 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC), 2023, pp. 236–241 IEEE
  30. “Xlnet: Generalized autoregressive pretraining for language understanding” In Advances in neural information processing systems 32, 2019
  31. “Cyto/myeloarchitecture of cortical gray matter and superficial white matter in early neurodevelopment: multimodal MRI study in preterm neonates” In Cerebral Cortex 33.2 Oxford University Press, 2023, pp. 357–373
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shiyu Yuan (4 papers)
  2. Carlo Lipizzi (10 papers)
Citations (2)