PLAtE: A Large-scale Dataset for List Page Web Extraction (2205.12386v2)
Abstract: Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
- Mohd Azir and Kamsuriah Ahmad. 2017. Wrapper approaches for web data extraction : A review. 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI), pages 1–6.
- Web page segmentation with structured prediction and its application in web page classification. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, page 767–776, New York, NY, USA. Association for Computing Machinery.
- Robust detection of semi-structured web records using a dom structure-knowledge-driven model. ACM Trans. Web, 7(4).
- Ten years of webtables. Proc. VLDB Endow., 11:2140–2149.
- Andrew Carlson and Charles Schafer. 2008. Bootstrapping information extraction from semi-structured web pages. In Proceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’08, page 195–210, Berlin, Heidelberg. Springer-Verlag.
- Dom-lm: Learning generalizable representations for html documents. arXiv preprint arXiv:2201.10608.
- Turl: Table understanding through representation learning.
- Semi-supervised multi-task learning of structured prediction models for web information extraction. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, page 957–966, New York, NY, USA. Association for Computing Machinery.
- Joseph L. Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33:613 – 619.
- Diadem: Thousands of websites to a single database. Proc. VLDB Endow., 7(14):1845–1856.
- Amber: Automatic supervision for multi-attribute extraction.
- Web-scale information extraction with vertex. In 2011 IEEE 27th International Conference on Data Engineering, pages 1209–1220.
- Amazonqa: A review-based question answering task. arXiv preprint arXiv:1908.04364.
- From one tree to a forest: A unified solution for structured web data extraction. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, page 775–784, New York, NY, USA. Association for Computing Machinery.
- TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computational Linguistics.
- The klarna product page dataset: A realistic benchmark for web representation learning.
- Lawrence J. Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification, 2:193–218.
- Nicholas Kushmerick. 1997. Wrapper induction for information extraction. University of Washington.
- Boilerplate removal using a neural sequence labeling model. Companion Proceedings of the Web Conference 2020.
- Markuplm: Pre-training of text and markup language for visually-rich document understanding.
- Freedom: A transferable neural architecture for structured information extraction on web documents. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
- Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80.
- Roberta: A robustly optimized bert pretraining approach.
- Zeroshotceres: Zero-shot relation extraction from semi-structured webpages.
- The wdc gold standards for product feature extraction and product matching. In EC-Web, volume 278 of Lecture Notes in Business Information Processing, pages 73–86.
- Tranco: A research-oriented top sites ranking hardened against manipulation. Proceedings 2019 Network and Distributed System Security Symposium.
- Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583–617.
- Attention is all you need.
- Tcn: Table convolutional network for web table interpretation. Proceedings of the Web Conference 2021.
- Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 2211–2220, New York, NY, USA. Association for Computing Machinery.
- Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, page 177–186, New York, NY, USA. Association for Computing Machinery.
- Joint optimization of wrapper generation and template detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, page 894–902, New York, NY, USA. Association for Computing Machinery.
- Simplified dom trees for transferable attribute extraction from the web.
- Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 494–503, New York, NY, USA. Association for Computing Machinery.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
- Extracting lists of data records from semi-structured web pages. Data & Knowledge Engineering, 64(2):491–509.
- Aidan San (3 papers)
- Yuan Zhuang (32 papers)
- Jan Bakus (3 papers)
- Colin Lockard (9 papers)
- David Ciemiewicz (1 paper)
- Sandeep Atluri (4 papers)
- Yangfeng Ji (59 papers)
- Kevin Small (15 papers)
- Heba Elfardy (4 papers)