Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Combining Language and Graph Models for Semi-structured Information Extraction on the Web (2402.14129v1)

Published 21 Feb 2024 in cs.IR and cs.CL

Abstract: Relation extraction is an efficient way of mining the extraordinary wealth of human knowledge on the Web. Existing methods rely on domain-specific training data or produce noisy outputs. We focus here on extracting targeted relations from semi-structured web pages given only a short description of the relation. We present GraphScholarBERT, an open-domain information extraction method based on a joint graph and LLM structure. GraphScholarBERT can generalize to previously unseen domains without additional data or training and produces only clean extraction results matched to the search keyword. Experiments show that GraphScholarBERT can improve extraction F1 scores by as much as 34.8\% compared to previous work in a zero-shot domain and zero-shot website setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. 2021. Htlm: Hyper-text pre-training and prompting of language models. arXiv preprint arXiv:2107.06955.
  2. 2013. Extraction and integration of partially overlapping web sources. Proceedings of the VLDB Endowment, 6(10):805–816.
  3. 2020. Information retrieval and extraction on covid-19 clinical articles using graph community detection and bio-bert embeddings. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020.
  4. 2022. Dom-lm: Learning generalizable representations for html documents. arXiv preprint arXiv:2201.10608.
  5. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Xin Luna Dong. 2019. Building a broad knowledge graph for products. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 25–25. IEEE.
  7. 2011. From one tree to a forest: a unified solution for structured web data extraction. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 775–784.
  8. 2023. The diminishing returns of masked language models to science. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1270–1283.
  9. 2019. Knowledge graph embedding based question answering. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 105–113.
  10. IMDb. 2022. Imdb statistics. https://www.imdb.com/pressroom/stats/. [Online; accessed Oct-06-2022].
  11. 2019. Openceres: When open information extraction meets the semi-structured web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056.
  12. 2020. Zeroshotceres: Zero-shot relation extraction from semi-structured webpages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8105–8117.
  13. 2011. Polyinfo: Polymer database for polymeric materials design. In 2011 International Conference on Emerging Intelligent Data and Web Technologies, pages 22–29. IEEE.
  14. Richard J. Roberts. 2001. PubMed Central: The GenBank of the published literature. Proceedings of the National Academy of Sciences, 98(2):381–382.
  15. 2021. Incorporating medical knowledge in bert for clinical relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5357–5366.
  16. 2016. Blending education and polymer science: Semiautomated creation of a thermodynamic property database. Journal of Chemical Education, 93(9):1561–1568.
  17. 2022. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns, 3(4).
  18. 2021. Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2211–2220.
  19. 2017. Natural language question/answering: Let users talk with the knowledge graph. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 217–226.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhi Hong (14 papers)
  2. Kyle Chard (87 papers)
  3. Ian Foster (138 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com