Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models for Scientific Information Extraction: An Empirical Study for Virology (2401.10040v1)

Published 18 Jan 2024 in cs.CL, cs.AI, cs.DL, cs.IT, and math.IT

Abstract: In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs' emergent abilities. For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5799–5811, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.
  3. A research graph dataset for connecting research data repositories using rd-switchboard. Scientific Data, 5:180099.
  4. Improving access to scientific literature with knowledge graphs. Bibliothek Forschung und Praxis, 44(3):516–529.
  5. Sören Auer. 2018. Towards an open research knowledge graph.
  6. Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies. Quantitative Science Studies, 1(1):377–386.
  7. The semantic web. Scientific american, 284(5):34–43.
  8. Web of science as a data source for research on scientific and scholarly activity. Quantitative Science Studies, 1(1):363–376.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  10. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  12. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378.
  13. Lisa Ehrlinger and Wolfram Wöß. 2016. Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS), 48(1-4):2.
  14. Introduction: what is a knowledge graph? Knowledge graphs: Methodology, tools and selected use cases, pages 1–10.
  15. Ivo M. Foppa. 2017. 7 - o. diekmann, j. heesterbeek, and j.a. metz (1991) and p. van den driessche and j. watmough (2002): The spread of infectious diseases in heterogeneous populations. In Ivo M. Foppa, editor, A Historical Introduction to Mathematical Modeling of Infectious Diseases, pages 157–194. Academic Press, Boston.
  16. Suzanne Fricke. 2018. Semantic scholar. Journal of the Medical Library Association: JMLA, 106(1):145.
  17. The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133, Santiago de Compostela, Spain. Association for Computational Linguistics.
  18. More informative abstracts revisited. Annals of internal medicine, 113(1):69–76.
  19. More informative abstracts of articles describing clinical practice guidelines.
  20. Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies, 1(1):414–427.
  21. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  22. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  23. Consort for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLoS medicine, 5(1):e20.
  24. Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture, pages 243–246.
  25. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
  26. The stm report. An overview of scientific and scholarly publishing. 5th edition October.
  27. Orkg-leaderboards: A systematic workflow for mining leaderboards as a knowledge graph. arXiv preprint arXiv:2305.11068.
  28. Zero-shot entailment of leaderboards for empirical ai research. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2023.
  29. Automated mining of leaderboards for empirical ai research. In Towards Open and Trustworthy Digital Societies: 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23, pages 453–470. Springer.
  30. Unifying question answering, text classification, and regression via span extraction. arXiv preprint arXiv:1904.09286.
  31. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
  32. Hemant Kulkarni. 1996. Structured abstracts: still more. Annals of Internal Medicine, 124(7):695–696.
  33. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  34. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496.
  35. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  36. Openaire research graph dump.
  37. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
  38. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
  39. Adoption of structured abstracts by general medical journals and format for a structured abstract. Journal of the Medical Library Association, 93(2):237.
  40. DART: Open-domain structured data record to text generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 432–447, Online. Association for Computational Linguistics.
  41. Comparing research contributions in a scholarly knowledge graph. In CEUR Workshop Proceedings 2526 (2019), volume 2526, pages 21–26. Aachen: RWTH Aachen.
  42. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  43. Language models are unsupervised multitask learners.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  45. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  46. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  47. Allen H Renear and Carole L Palmer. 2009. Strategic reading, ontologies, and the future of scientific publishing. Science, 325(5942):828–832.
  48. The novel coronavirus, 2019-ncov, is highly contagious and more infectious than initially estimated. arXiv preprint arXiv:2002.03268.
  49. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  50. Multitask prompted training enables zero-shot task generalization.
  51. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR.
  52. David Shotton. 2009. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2):85–94.
  53. Luciana B Sollaci and Mauricio G Pereira. 2004. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. Journal of the medical library association, 92(3):364.
  54. Fair scientific information with the open research knowledge graph. FAIR Connect, 1(1):19–21.
  55. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413.
  56. Cord-19: The covid-19 open research dataset. ArXiv.
  57. Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 5776–5788.
  58. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  59. Towards customizable chart visualizations of tabular data using knowledge graphs. In Digital Libraries at Times of Massive Societal Transition: 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, Kyoto, Japan, November 30–December 1, 2020, Proceedings 22, pages 71–80. Springer.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com