Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ontology-Driven Hidden Web Crawler

Updated 27 April 2026
  • Ontology-driven hidden web crawler is a semantic system that leverages domain ontologies to interpret HTML forms and extract deep web content.
  • It integrates modules such as an ontology builder, hidden web miner, and result processor to map form fields and generate prioritized queries.
  • Empirical results demonstrate improved precision and retrieval coverage over keyword-based crawlers through advanced semantic mapping and automated schema integration.

An ontology-driven hidden web crawler is a semantic web-oriented information retrieval system designed to automatically access, parse, and extract content from the "deep web"—resources accessible only through HTML forms—by leveraging formal domain ontologies. This approach enables precise interpretation of form structures, facilitates accurate mapping of form fields to knowledge representations, and supports automated, domain-specific query generation and integration. The architecture is rooted in semantic modeling, multi-layered label analysis, and rule-based query rewriting, achieving substantially higher coverage, precision, and adaptability than traditional keyword-based crawlers (Manvi et al., 2015, Furche et al., 2012).

1. System Architecture and Workflow

Ontology-driven hidden web crawlers comprise a modular pipeline integrating ontology construction, deep web mining, semantic label mapping, query orchestration, and adaptive result processing. In Manvi et al.'s design (Manvi et al., 2015), the system decomposes into the following key modules:

  • Central Coordinator: Orchestrates and synchronizes all modules, maintains shared buffers, and interfaces with the end user.
  • Ontology Builder: Ingests a set of seed URLs per domain, downloads accessible pages, extracts RDF data, and constructs the domain ontology as a directed labeled graph, storing (subject, predicate, object) ontology tuples in a Domain-Specific Database (DSDB).
  • Hidden Web Miner:
    • Form Downloader: Identifies and retrieves HTML form pages.
    • Form Analyzer & Ontology Creator: Parses form fields, constructs form-level ontologies.
    • Mapping Manager: Conducts semantic matching between form fields and DSDB properties.
    • Query Generator: Instantiates and orders domain-relevant query tuples for submission.
  • Result Processor: Filters erroneous or irrelevant pages, ranks and presents results to the user, and updates DSDB with newly discovered ontology fragments.
  • Domain-Specific Database (DSDB): Persistent storage for domain ontology structures, supporting both candidate value retrieval and incremental enrichment.

The OPAL framework (Furche et al., 2012) specifies a related workflow: (1) fetch and render the target page to extract the DOM and CSS bounding boxes; (2) perform multi-scope form labeling (field, segment, layout) to accurately attach semantic labels; (3) interpret and annotate fields using OPAL-TL Datalog templates over the ontology-derived schema; (4) execute master query translation and form fill-in with similarity-based alignment and rewriting; (5) extract the resulting deep web content.

2. Formal Ontology Representation and Domain Schema Design

The underlying ontology model is expressed as the triple O=(C,P,R)O = (C, P, R), where:

  • C={c1,…,cn}C = \{c_1, \ldots, c_n\} is the finite set of domain classes (e.g., Book, Flight),
  • P={p1,…,pm}P = \{p_1, \ldots, p_m\} is the set of datatype properties or attributes (e.g., author, price, flightNumber),
  • R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\} encompasses labeled binary relations.

This formalism renders the ontology as a labeled, directed graph G=(V,E)G = (V, E) with vertices V=C∪PV = C \cup P and edge set E⊆V×VE \subseteq V \times V defined by RR. The persistent DSDB stores each triple (subject, predicate, object) as a database row.

In OPAL, the domain schema Σ\Sigma is specified using OPAL-TL, a Datalog-with-Templates language. The schema extends the base annotation model (types, subtype relations, constraints) to define a set of domain types TT, part-of relations, annotation constraints C={c1,…,cn}C = \{c_1, \ldots, c_n\}0, and structural type constraints C={c1,…,cn}C = \{c_1, \ldots, c_n\}1. OPAL provides template families for object, free-text, categorical, and numeric types, which can be instantiated per-property to automate schema specification. The schema is typically derived from an existing OWL/RDF ontology via normalization and annotation rule generation (Furche et al., 2012).

3. Semantic Labeling and Field-to-Ontology Mapping

Mapping between raw HTML forms and ontology properties is a two-stage process, integrating semantic and statistical similarity functions:

3.1 Label Extraction

Form labeling is achieved through a cascade of scopes:

  • Field Scope: Labels are attached to fields via DOM structure, utilizing the <label> element and nearest unique ancestors. False-positive rates are empirically low (C={c1,…,cn}C = \{c_1, \ldots, c_n\}20.3%).
  • Segment Scope: Collapses uninformative DOM nodes, grouping related fields and propagating shared label groups, capturing multi-field groupings (e.g., "Bedrooms: [ ] [ ] [ ]").
  • Layout Scope: Uses rendered CSS spatial adjacency (bounding boxes) to assign visual neighbors (W, NW, N) as labels, resolving ambiguities with formal overshadowing logic.

3.2 Semantic Matching

Each field C={c1,…,cn}C = \{c_1, \ldots, c_n\}3 is aligned to an ontology property C={c1,…,cn}C = \{c_1, \ldots, c_n\}4 using a weighted similarity measure: C={c1,…,cn}C = \{c_1, \ldots, c_n\}5 where C={c1,…,cn}C = \{c_1, \ldots, c_n\}6 is typically Jaccard string similarity, and C={c1,…,cn}C = \{c_1, \ldots, c_n\}7 uses WordNet-based synonymy; thresholds C={c1,…,cn}C = \{c_1, \ldots, c_n\}8 (exact label) and C={c1,…,cn}C = \{c_1, \ldots, c_n\}9 (synonym) control mapping confidence.

In OPAL, this process yields a form model P={p1,…,pm}P = \{p_1, \ldots, p_m\}0: a labeled tree whose nodes are classified into ontology types, satisfying schema constraints via a combination of gazetteer-based annotations and model repair operations in linear time (Furche et al., 2012).

4. Automated Query Generation and Integration

Once mappings are established, query generation is formalized as follows:

Given P={p1,…,pm}P = \{p_1, \ldots, p_m\}1 mapped to P={p1,…,pm}P = \{p_1, \ldots, p_m\}2, let P={p1,…,pm}P = \{p_1, \ldots, p_m\}3 denote the set of candidate values for P={p1,…,pm}P = \{p_1, \ldots, p_m\}4 from the DSDB. All P={p1,…,pm}P = \{p_1, \ldots, p_m\}5-ary tuples P={p1,…,pm}P = \{p_1, \ldots, p_m\}6 represent concrete query instantiations. Each query P={p1,…,pm}P = \{p_1, \ldots, p_m\}7 is assigned a priority weight: P={p1,…,pm}P = \{p_1, \ldots, p_m\}8 where P={p1,…,pm}P = \{p_1, \ldots, p_m\}9 reflects historical value use or relevance in the DSDB.

Queries are submitted in order of descending R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}0 (i.e., most relevant first), subject to a resource budget. Returned result pages are evaluated, filtered, ranked, and, when relevant, trigger ontology enrichment (Manvi et al., 2015).

In OPAL, master queries over R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}1 (e.g., R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}2) are rewritten into form fill-ins using similarity, type-based field determination, and explicit mapping rules. Free-text, select, radio, and multi-select widgets are handled by type-appropriate algorithms, and optimistic semantics drop constraints that do not correspond to form fields (Furche et al., 2012).

5. Empirical Results and Evaluation Metrics

Performance is assessed using established information retrieval measures:

  • Precision R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}3
  • Recall R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}4
  • Correct-Page Rate R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}5
  • Useful-Page Rate R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}6
Domain Sites Visited Forms Found Pages Retrieved Correct Pages Useful Pages C (%) U (%)
Airline 15 25 150 118 95 78.6 81.0
Books 12 20 190 120 102 63.2 85.0

Baseline correct-page rates are 50–60%, useful-page rates 70–75%, showing a substantial improvement for the ontology-driven crawler.

In OPAL, across benchmarks such as ICQ and TEL-8, domain-independent labeling F-score reaches 95–100%; full schema-based classification yields precision and recall R={(ci,cj,rk)}R = \{(c_i, c_j, r_k)\}797%, with query integration error rates below 6% on all constraints and 87% of forms filled perfectly. Throughput on 80% of forms with large DOMs is within 30 seconds per form (Furche et al., 2012).

6. Strengths, Limitations, and Prospective Enhancements

Ontology-driven hidden web crawling demonstrates clear advantages:

  • Semantic Precision: Substantially higher coverage and useful retrieval via knowledge-driven mapping and adaptive result enrichment.
  • Domain Adaptivity: Incrementally refined DSDB supports improved future crawls.
  • Robustness and Portability: Domain schemas (OPAL-TL) can be rapidly instantiated and reused; adapting to new domains or evolving form structures entails minimal effort.

Limitations include initial ontology construction effort, sensitivity of label matching to heterogeneous vocabularies, and the combinatorial query explosion for forms with many fields.

Proposed improvements include more advanced semantic similarity (e.g., embedding-based models), query space pruning via active learning, scalable parallelized processing, and automated ontology enhancement via wrapper induction (Manvi et al., 2015).

A plausible implication is that as new forms and widget types (e.g., sliders, AJAX-based fields) are introduced, the modularity of these architectures supports extensibility via minor pipeline or schema updates without wholesale redesign (Furche et al., 2012).

Through formal domain modeling, robust multi-scope field interpretation, and prioritized query expansion, ontology-driven crawlers substantially advance the state of hidden web content extraction and integration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ontology-Driven Hidden Web Crawler.