Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction (2311.08189v3)

Published 14 Nov 2023 in cs.CL

Abstract: Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of LLMs such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
  2. Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 546–555.
  3. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
  4. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610.
  5. Turl: Table understanding through representation learning. SIGMOD Rec., 51:33–40.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
  7. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246.
  8. Jennifer D’Souza and Sören Auer. 2020. Nlpcontributions: An annotation scheme for machine reading of scholarly contributions in natural language processing literature. In Proceedings of the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents co-located with the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL 2020). Aachen: RWTH.
  9. Transfer learning for named entity recognition in financial and biomedical documents. Information, 10(8):248.
  10. SemEval-2018 task 7: Semantic relation extraction and classification in scientific papers. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 679–688.
  11. Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450.
  12. Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5203–5213.
  13. Tdmsci: A specialized corpus for scientific literature entity tagging of tasks datasets and metrics. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 707–714.
  14. Scirex: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516.
  15. Automated mining of leaderboards for empirical ai research. In International Conference on Asian Digital Libraries, pages 453–470. Springer.
  16. Mozi: A scientific large-scale language model.
  17. Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633.
  18. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  19. Yi Luan. 2018. Information extraction from scientific literature for method recommendation. arXiv preprint arXiv:1901.00401.
  20. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3219–3232.
  21. A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3036–3046.
  22. Gsap-ner: A novel task, corpus, and baseline for scholarly entity extraction focused on machine learning models and datasets. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8166–8176.
  23. TIARA: Multi-grained retrieval for robust question answering over large knowledge base. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  25. Citationie: Leveraging the citation graph for scientific information extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 719–731.
  26. Covid-19 literature knowledge graph construction and drug repurposing report generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 66–77.
  27. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  28. TACR: A Table-alignment-based Cell-selection and Reasoning Model for Hybrid Question-Answering. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada. Association for Computational Linguistics.
  29. Fine-tuned llms know more, hallucinate less with few-shot sequence-to-sequence semantic parsing over wikidata. In Proceedings of EMNLP2023.
  30. Telin: Table entity linker for extracting leaderboards from machine learning publications. In Proceedings of the first Workshop on Information Extraction from Scientific Publications, pages 20–25.
  31. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR (Poster).
  32. Packed levitated marker for entity and relation extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4904–4917.
  33. Onelabeler: A flexible system for building data labeling tools. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.
  34. Zexuan Zhong and Danqi Chen. 2021. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 50–61.
  35. Resel: N-ary relation extraction from scientific text and tables by learning to retrieve and select. arXiv preprint arXiv:2210.14427.
Citations (1)

Summary

We haven't generated a summary for this paper yet.