Cleaner Pretraining Corpus Curation with Neural Web Scraping (2402.14652v3)
Abstract: The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for LLM pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the LLM pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.
- Gpt-4 technical report. ArXiv preprint.
- Adrien Barbaresi. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of ACL, pages 122–131.
- Fiasco: Filtering the internet by automatic subtree classification, osnabruck. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, incorporating CleanEval, pages 111–121.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Understanding website complexity: measurements, metrics, and implications. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 313–328.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, (240):1–113.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL, pages 8440–8451.
- Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109–118.
- Fact or fiction: Content classification for digital libraries. In DELOS.
- The pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint.
- The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 830–839.
- Econ: an approach to extract content from web news page. In 2010 12th International Asia-Pacific Web Conference, pages 314–320. IEEE.
- Dom-based content extraction of HTML documents. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003, pages 207–214.
- Florian Hantke and Ben Stock. 2022. Html violations and where to find them: a longitudinal analysis of specification violations in html. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 358–373.
- Training compute-optimal large language models. ArXiv preprint.
- Boilerplate detection using shallow text features. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010, pages 441–450.
- Markuplm: Pre-training of text and markup language for visually-rich document understanding. ArXiv preprint.
- Pointer sentinel mixture models. In Proceedings of ICLR.
- A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190–197.
- Clueweb22: 10 billion web documents with visual and semantic information. ArXiv preprint.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint.
- Matthew E Peters and Dan Lecocq. 2013. Content extraction using diverse feature sets. In Proceedings of the 22nd international conference on world wide web, pages 89–90.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, (1):5485–5551.
- Content extraction from html documents. In 1st Int. Workshop on Web Document Analysis (WDA2001), pages 1–4.
- Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12–17.
- DOM based content extraction via text density. In Proceedings of SIGIR, pages 245–254.
- Llama: Open and efficient foundation language models. ArXiv preprint.
- Attention is all you need. In Proceedings of NeurIPS, pages 5998–6008.
- A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 258–267.
- Will we run out of data? an analysis of the limits of scaling datasets in machine learning. ArXiv preprint.
- Web2text: Deep structured boilerplate removal. In Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40, pages 167–179. Springer.
- Webformer: The web-page transformer for structure information extraction. In Proceedings of the ACM Web Conference 2022, pages 3124–3133.
- Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392.
- Albert Weichselbraun. 2021. Inscriptis–a python-based html to text conversion library optimized for knowledge extraction from the web. ArXiv preprint.
- CETR: content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 971–980.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012.
- Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296–305.
- Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104.
- A survey of large language models. ArXiv preprint.