FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents (2010.10755v1)

Published 21 Oct 2020 in cs.CL and cs.IR

Abstract: Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Bill Yuchen Lin (72 papers)
Ying Sheng (31 papers)
Nguyen Vo (12 papers)
Sandeep Tata (14 papers)

Citations (41)

View on Semantic Scholar

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents (2010.10755v1)

Related Papers