Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches (2307.00130v1)
Abstract: Information extraction (IE) plays very important role in NLP and is fundamental to many NLP applications that used to extract structured information from unstructured text data. Heuristic-based searching and data-driven learning are two main stream implementation approaches. However, no much attention has been paid to document genre and length influence on IE tasks. To fill the gap, in this study, we investigated the accuracy and generalization abilities of heuristic-based searching and data-driven to perform two IE tasks: named entity recognition (NER) and semantic role labeling (SRL) on domain-specific and generic documents with different length. We posited two hypotheses: first, short documents may yield better accuracy results compared to long documents; second, generic documents may exhibit superior extraction outcomes relative to domain-dependent documents due to training document genre limitations. Our findings reveals that no single method demonstrated overwhelming performance in both tasks. For named entity extraction, data-driven approaches outperformed symbolic methods in terms of accuracy, particularly in short texts. In the case of semantic roles extraction, we observed that heuristic-based searching method and data-driven based model with syntax representation surpassed the performance of pure data-driven approach which only consider semantic information. Additionally, we discovered that different semantic roles exhibited varying accuracy levels with the same method. This study offers valuable insights for downstream text mining tasks, such as NER and SRL, when addressing various document features and genres.
- “A review of relation extraction” In Literature review for Language and Statistics II 2, 2007, pp. 1–15
- “English propbank annotation guidelines” In Center for Computational Language and Education Research Institute of Cognitive Science University of Colorado at Boulder 48, 2012
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- Danqi Chen and Christopher D Manning “A fast and accurate dependency parser using neural networks” In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 740–750
- “Universal Dependency Relations”, 2022 URL: https://universaldependencies.org/u/dep/index.html
- Charles J Fillmore “Frame semantics” In Cognitive linguistics: Basic readings 34 Mouton de Gruyter Berlin, 2006, pp. 373–400
- “Brain Inflammation, Degeneration, and Plasticity in Multiple Sclerosis” Elsevier, 2015
- huggingface “Dataset Card for CoNLL2012 shared task data based on OntoNotes 5.0”, 2022 URL: https://huggingface.co/datasets/conll2012_ontonotesv5
- huggingface “Preprocess”, 2022 URL: https://huggingface.co/docs/transformers/preprocessing
- Dan Jurafsky and James H Martin “Speech and language processing. Vol. 3” In US: Prentice Hall, 2014
- “Knowledge Graph Anchored Information-Extraction for Domain-Specific Insights” In arXiv preprint arXiv:2104.08936, 2021
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- Ora Lassila and Ralph R Swick “Resource description framework (RDF) model and syntax specification” Citeseer, 1998
- LDC “OntoNotes Release 5.0”, 2022 URL: https://catalog.ldc.upenn.edu/LDC2013T19
- “A survey on deep learning for named entity recognition” In IEEE Transactions on Knowledge and Data Engineering 34.1 IEEE, 2020, pp. 50–70
- “Syntax Role for Neural Semantic Role Labeling” In Computational Linguistics 47.3 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021, pp. 529–574
- “A survey of named entity recognition and classification” In Lingvisticae Investigationes 30.1 John Benjamins, 2007, pp. 3–26
- Rainer Osswald and Robert D Van Valin “FrameNet, frame structure, and the syntax-semantics interface” In Frames and concept types: Applications in language and philosophy Springer, 2014, pp. 125–156
- Nadeesha Perera, Matthias Dehmer and Frank Emmert-Streib “Named entity recognition and relation detection for biomedical information extraction” In Frontiers in cell and developmental biology Frontiers, 2020, pp. 673
- J.K. Rowling “Harry Potter and the Sorcerer’s Stone” In Bloomsbury, 1997
- “A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate” In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019, pp. 338–343 IEEE
- “Simple bert models for relation extraction and semantic role labeling” In arXiv preprint arXiv:1904.05255, 2019
- Roger P Simon, Michael Jeffrey Aminoff and David A Greenberg “Clinical neurology” Lange Medical Books/McGraw-Hill, 2009
- spaCy “Language support”, 2022 URL: https://spacy.io/usage/models#languages
- “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
- Maximilian Vierlboeck, Carlo Lipizzi and Roshanak Nilchiani “Natural Language in Requirements Engineering for Structure Inference–An Integrative Review” In arXiv preprint arXiv:2202.05065, 2022
- Pete Wells “Restaurant Review: A Magnet for Wine Nerds Gets a Recharge”, 2022 URL: https://www.nytimes.com/2022/09/12/dining/chambers-review-pete-wells.html
- Rongen Yan, Xue Jiang and Depeng Dang “Named entity recognition by using XLNet-BiLSTM-CRF” In Neural Processing Letters 53.5 Springer, 2021, pp. 3339–3356
- “5G RRC Protocol and Stack Vulnerabilities Detection via Listen-and-Learn” In 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC), 2023, pp. 236–241 IEEE
- “Xlnet: Generalized autoregressive pretraining for language understanding” In Advances in neural information processing systems 32, 2019
- “Cyto/myeloarchitecture of cortical gray matter and superficial white matter in early neurodevelopment: multimodal MRI study in preterm neonates” In Cerebral Cortex 33.2 Oxford University Press, 2023, pp. 357–373
- Shiyu Yuan (4 papers)
- Carlo Lipizzi (10 papers)