Wrapper Maintenance: A Machine Learning Approach (1106.4872v1)

Published 24 Jun 2011 in cs.AI

Abstract: The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task.

Citations (179)

View on Semantic Scholar

Summary

The paper proposes a novel machine learning framework including the DATAPROG algorithm for learning data structure patterns, a verification system using these patterns and numeric features, and a reinduction process combining supervised and unsupervised learning to maintain data extraction wrappers.
The empirical evaluation demonstrated high performance for the verification system, achieving a recall of 0.95 in detecting wrapper changes, which is superior to state-of-the-art methods.
The reinduction process successfully recovered broken wrappers for single-tuple sources with 0.90 precision and 0.80 recall, significantly enhancing the practical reliability of web data extraction amidst frequent site changes.

Overview of "Wrapper Maintenance: A Machine Learning Approach"

The paper "Wrapper Maintenance: A Machine Learning Approach" by Lerman, Minton, and Knoblock addresses the critical issue of maintaining data extraction wrappers for semi-structured web sources amid frequent changes in web formats. Traditional research has primarily focused on generating wrappers quickly, often neglecting the maintenance aspect. This paper proposes a novel machine learning approach for wrapper verification and reinduction—two key facets of wrapper maintenance.

Key Contributions

Data Prototype Learning: The cornerstone of the paper is the development of the DATAPROG algorithm, designed to learn the structural patterns of data fields from positive examples without requiring negative examples. This task is framed as a conservation task rather than a classification task to ensure that all structural regularities, including redundant features, are captured. DATAPROG leverages a token-level representation and hypothesis testing to evaluate the statistical significance of learned data patterns.
Wrapper Verification: The authors introduce a system that detects incorrect data extraction due to changes in web source structures by employing the learned data prototypes alongside numeric features such as token densities. Their empirical validation involved monitoring 27 wrappers over 10 months, resulting in a recall of 0.95 in detecting wrapper changes.
Wrapper Reinduction: The paper presents a reinduction process that allows the automatic recovery of broken wrappers by relabeling data on modified web pages. This process combines supervised and unsupervised learning, facilitating the extraction and proper labeling of data segments, which the STALKER wrapper induction system subsequently uses to regenerate extraction rules. Notably, the reinduced wrappers demonstrated precision and recall values of 0.90 and 0.80, respectively, in their extraction tasks.
Page Template Learning: The paper outlines a page template induction algorithm that discerns static templates used across web pages, which can be pivotal in identifying variable data slots, thus enhancing the reinduction process.

Experimental Evaluation

The experimental framework revealed that DATAPROG provides a robust verification mechanism compared to state-of-the-art methods like RAPTURE, achieving higher precision and recall due to its pattern-based approach. For the wrapper reinduction task, the method showed competence in dealing with sources returning single tuples ("detail pages") but less efficacy for sources presenting lists, highlighting a direction where further algorithmic development is needed.

Implications and Future Directions

This paper advances the practical deployment of intelligent web agents by addressing wrapper maintenance, a traditionally overlooked aspect essential for sustainable web data extraction. The implications extend to a broader range of semi-structured data sources where changes are frequent, potentially increasing the reliability and applicability of data-intensive applications.

For future work, improving the reinduction algorithm to handle list-type web sources effectively will be crucial. Expanding the capabilities of automatic learning systems, like DATAPROG, to generate wrappers for new domains without manual intervention remains a critical avenue of research. The automation of such processes could lead towards the autonomous upkeep of large-scale web scraping frameworks, which is vital in coping with the constantly evolving landscape of web content.

In summary, this paper provides significant insights into overcoming the challenges of wrapper maintenance through machine learning, underscoring its fundamental role in efficient web data extraction practices.