Efficient Execution of Interleaved Navigation-and-Gathering Plans

Develop efficient execution techniques for information-gathering plans that interleave navigation with data retrieval across multiple hyperlinked web pages (e.g., via successive “Next page” links), so that these plans can process distributed result sets with high performance and without blocking while traversing and collecting results from indeterminate-length page sequences.

Background

Web-based information integration often requires wrappers to extract structured data from semi-structured HTML and to handle result lists spread across multiple pages connected by “Next” links. While prior work has explored incorporating navigation into query answering at the processing phase, the challenge of efficiently executing plans that must interleave navigation with data gathering persists. This problem is central for agents that must loop over an unknown number of pages while streaming and processing results promptly.

The paper motivates this need via practical tasks (e.g., classifieds, real estate listings) where a logical relation is distributed over many pages. The authors note a gap: existing solutions address theoretical integration of navigation, but not the efficient runtime execution architecture needed to handle such interleaved retrieval robustly and in parallel.

References

However, such solutions mostly address the query processing phase and it remains an open issue regarding how to execute these types of information gathering plans efficiently.

An Expressive Language and Efficient Execution System for Software Agents  (1109.2048 - Barish et al., 2011) in Section 2.2 (Web-based Information Gathering and Integration)