Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network (2305.05378v1)

Published 9 May 2023 in cs.CL

Abstract: The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained LLM and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qiwei Lang (2 papers)
  2. Jingbo Zhou (51 papers)
  3. Haoyi Wang (9 papers)
  4. Shiqi Lyu (1 paper)
  5. Rui Zhang (1138 papers)
Citations (2)