Text-to-Table: A New Way of Information Extraction (2109.02707v2)

Published 6 Sep 2021 in cs.CL

Abstract: We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained LLM to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data are available at https://github.com/shirley-wu/text_to_table.

Citations (45)

View on Semantic Scholar

Summary

The paper presents a novel task that reframes information extraction as generating structured tables from text using a seq2seq model.
It integrates table constraints and relation embeddings to ensure consistent table structure and effective header alignment.
Experimental results on diverse datasets demonstrate superior performance over traditional schema-based methods.

An Overview of "Text-to-Table: A New Way of Information Extraction"

In the paper "Text-to-Table: A New Way of Information Extraction," Wu, Zhang, and Li propose a novel task for information extraction (IE) framed as the generation of structured tables from unstructured text data. This paper introduces a new analytical perspective within the field of IE by formalizing the text-to-table task as a sequence-to-sequence (seq2seq) problem. The authors present an approach that integrates a fine-tuned pre-trained LLM, supplemented by additional techniques of table constraint and table relation embeddings to improve table generation performances.

Problem Setting and Methodology

The text-to-table task differs significantly from traditional IE approaches like named entity recognition (NER) or relation extraction (RE). In text-to-table, tables are generated directly from texts without requiring predefined schemas. This data-driven approach permits the extraction of complex structured information, such as tables with multiple columns and rows, directly from long documents, circumventing manual schema annotations.

The authors view text-to-table as the inverse problem of the table-to-text task, which converts structured data into descriptive text. Text-to-table provides direct advantages in applications needing document summarization and knowledge representation via tables, such as sports score reports and Wikipedia biographies.

To implement text-to-table, the authors apply a seq2seq model, a common framework within NLP tasks given its efficacy in machine translation and text summarization. Here, a sequence representation of table data is derived from input text, where the sequence captures both schema and cell content. The proposed seq2seq model builds upon BART, a pre-trained LLM acknowledged for its proficiency in seq2seq tasks.

Advances and Techniques

The paper presents two pivotal techniques to overcome key challenges in table generation: table constraint and table relation embeddings. Table constraint ensures the uniformity of cell numbers across the table’s rows, maintaining output consistency. Table relation embeddings introduce row and column header alignments within the attention mechanism of the seq2seq model, promoting well-structured table generation and embedding critical relational information directly into the model.

Experimental Results

Experiments conducted on four datasets—Rotowire, E2E, WikiTableText, and WikiBio—demonstrate the superiority of the proposed approach over conventional methods relying on predefined IE schemas. The vanilla seq2seq model already outperformed baseline models utilizing sentence-level and document-level RE and NER. Further, the integration of table constraint and relation embeddings notably enhanced the model performance, especially in datasets with complex tabular structures, like Rotowire.

Results indicate strong adaptability of the approach across different datasets, underlying the method’s versatility in capturing structured information without explicit schema definitions. Table constraint was particularly beneficial in datasets with large tables, ensuring well-structured outputs, while relational embeddings facilitated superior alignment handling for columns and rows.

Implications and Future Directions

The introduction of text-to-table offers promising implications for theoretical advancements and practical applications in IE, challenging existing paradigms by integrating schema-free extraction of structured data. As the methodology evolves, leveraging larger and more sophisticated pre-trained LLMs may further advance the boundaries of text-to-table applications, particularly in domains requiring intricate encapsulation of relationships and entities.

There remain challenges around text diversity, redundancy, open-domain knowledge incorporation, and reasoning, as highlighted by the authors. These areas suggest directions for further refinement and development in seq2seq modeling and table generation techniques. Expanding the efficacy and robustness of text-to-table can substantially enhance the automation capabilities in analytics, summarization, and knowledge representation sectors.

In summation, this paper marks a significant step in rethinking approaches to information extraction through novel modeling techniques. While promising, future work will likely focus on addressing current limitations in text representation and reasoning while exploring auxiliary data sources and enhanced model designs.

PDF Markdown

Related Papers

GitHub

GitHub - shirley-wu/text_to_table (68 stars)