Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Cleaner Pretraining Corpus Curation with Neural Web Scraping (2402.14652v3)

Published 22 Feb 2024 in cs.CL

Abstract: The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for LLM pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the LLM pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Gpt-4 technical report. ArXiv preprint.
  2. Adrien Barbaresi. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of ACL, pages 122–131.
  3. Fiasco: Filtering the internet by automatic subtree classification, osnabruck. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, incorporating CleanEval, pages 111–121.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  5. Understanding website complexity: measurements, metrics, and implications. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 313–328.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, (240):1–113.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL, pages 8440–8451.
  8. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109–118.
  9. Fact or fiction: Content classification for digital libraries. In DELOS.
  10. The pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint.
  11. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 830–839.
  12. Econ: an approach to extract content from web news page. In 2010 12th International Asia-Pacific Web Conference, pages 314–320. IEEE.
  13. Dom-based content extraction of HTML documents. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003, pages 207–214.
  14. Florian Hantke and Ben Stock. 2022. Html violations and where to find them: a longitudinal analysis of specification violations in html. In Proceedings of the 22nd ACM Internet Measurement Conference, pages 358–373.
  15. Training compute-optimal large language models. ArXiv preprint.
  16. Boilerplate detection using shallow text features. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010, pages 441–450.
  17. Markuplm: Pre-training of text and markup language for visually-rich document understanding. ArXiv preprint.
  18. Pointer sentinel mixture models. In Proceedings of ICLR.
  19. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190–197.
  20. Clueweb22: 10 billion web documents with visual and semantic information. ArXiv preprint.
  21. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint.
  22. Matthew E Peters and Dan Lecocq. 2013. Content extraction using diverse feature sets. In Proceedings of the 22nd international conference on world wide web, pages 89–90.
  23. Language models are unsupervised multitask learners.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, (1):5485–5551.
  25. Content extraction from html documents. In 1st Int. Workshop on Web Document Analysis (WDA2001), pages 1–4.
  26. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12–17.
  27. DOM based content extraction via text density. In Proceedings of SIGIR, pages 245–254.
  28. Llama: Open and efficient foundation language models. ArXiv preprint.
  29. Attention is all you need. In Proceedings of NeurIPS, pages 5998–6008.
  30. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 258–267.
  31. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. ArXiv preprint.
  32. Web2text: Deep structured boilerplate removal. In Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40, pages 167–179. Springer.
  33. Webformer: The web-page transformer for structure information extraction. In Proceedings of the ACM Web Conference 2022, pages 3124–3133.
  34. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392.
  35. Albert Weichselbraun. 2021. Inscriptis–a python-based html to text conversion library optimized for knowledge extraction from the web. ArXiv preprint.
  36. CETR: content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 971–980.
  37. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012.
  38. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296–305.
  39. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104.
  40. A survey of large language models. ArXiv preprint.
Citations (4)

Summary

  • The paper introduces NeuScraper, a neural web scraping method that improves text extraction quality by over 20% compared to traditional approaches.
  • It employs a shallow neural architecture combined with DOM layout analysis and XLMRoberta text representation to achieve precise content extraction.
  • Empirical validation shows that NeuScraper enhances downstream LLM performance and reduces perplexity on benchmarks like Wikitext and Lambada.

Enhancing Pretraining Corpus Quality with Neural Web Scraping: Insights from NeuScraper

The Imperative for High-Quality Text Extraction

With the ascendancy of LLMs across a plethora of NLP tasks, the imperative for high-quality, large-scale datasets has never been more critical. The proportional scaling of model size and training data posits a unique challenge of sourcing and preparing suitable datasets for pretraining purposes. Common web-crawled datasets, despite their ubiquity and scale, often fall short in quality due to the presence of irrelevant content such as advertisements and hyperlinks, underscoring the need for sophisticated web scraping techniques.

Current Web Scraping Paradigms

Current methodologies predominantly leverage rule-based or feature-based web scrapers to sift through the chaotic heterogeneity of web pages. These approaches, while foundational, exhibit limitations in adaptability and require significant manual intervention to maintain, making them less viable with the increasing complexity of web pages. This backdrop sets the stage for exploring more dynamic, less labor-intensive solutions that can extract primary content with higher precision and efficiency.

Introducing NeuScraper

This context brings us to the advent of NeuScraper, a novel approach grounded in neural networks to address the challenges of web scraping for pretraining corpora. NeuScraper eschews traditional rule or feature-based methods for a shallow neural architecture combined with layout information for parsing, presenting a significant leap in efficacy, as evidenced by a more than 20% improvement in text extraction quality over conventional baseline scrapers.

Leveraging Neural Architectures for Web Scraping

NeuScraper's methodology encapsulates a compelling fusion of sequence modeling of web pages informed by DOM tree structures, and a hierarchical architecture for node-level prediction, utilizing the XLMRoberta model for text representation. This innovative approach enables NeuScraper to discern and extract primary content with notable precision, evidenced by a substantial boost in performance metrics such as accuracy, precision, recall, and F1 scores, alongside notable reductions in latency when deployed on GPU infrastructure.

Empirical Validation and Implications

The empirical validation of NeuScraper presents compelling results. When benchmarked against a suite of conventional web scrapers across the ClueWeb22 dataset, NeuScraper's superiority emerges not just in the immediate context of text extraction effectiveness but also in the downstream impact on the quality of pretraining corpora for LLMs. LLMs pretrained on data curated by NeuScraper showcased enhanced performance in standard NLP tasks, validating the hypothesis that the quality of pretraining data is a pivotal determinant of model efficacy.

Moreover, the reduced perplexity in reproducing target corpora such as Wikitext and Lambada further underscores the qualitative advantage conferred by NeuScraper's neural approach to web scraping. This result is particularly indicative of the potential for neural web scrapers to elevate the baseline for dataset preparation in the NLP domain.

Future Trajectories and Considerations

The exploration of neural web scraping, as epitomized by NeuScraper, opens new vistas in the curation of pretraining corpora. The demonstrated effectiveness and efficiency of neural approaches in navigating the complexities of web content extraction herald a promising avenue for advancing the capabilities of LLMs. However, this venture is not without its exigencies. The reliance on GPU parallelism for high-speed scraping and the need for high-throughput storage mediums for large-scale corpus processing delineate the infrastructure prerequisites for leveraging neural web scraping at scale.

Conclusion

In summation, NeuScraper's contribution to the corpus curation process for LLM pretraining is illustrative of the broader potential of neural methods in web scraping. By addressing the limitations of existing rule-based and feature-based approaches, NeuScraper not only showcases the immediate benefits in terms of data quality and scraping efficiency but also sets a new benchmark for future explorations in the field. As we venture further into the development of more sophisticated LLMs, the role of advanced web scraping methods like NeuScraper will undoubtedly become increasingly central.