DART: Open-Domain Structured Data Record to Text Generation (2007.02871v2)

Published 6 Jul 2020 in cs.CL

Abstract: We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks by utilizing techniques such as: tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.

Citations (187)

View on Semantic Scholar

Summary

The paper introduces DART, a large-scale corpus that leverages tree ontologies to enhance data-to-text generation.
It presents a novel methodology that preserves semantic dependencies via connected component extraction from diverse data sources.
Evaluation shows that models like T5-large achieve high BLEU scores, underscoring DART’s impact on robust text generation.

An Academic Overview of "DART: Open-Domain Structured Data Record to Text Generation"

The paper, "DART: Open-Domain Structured Data Record to Text Generation," introduces a comprehensive and diverse corpus that addresses current shortcomings in data-to-text generation. It proposes a framework for generating human-readable text from structured data by leveraging tree-structured ontologies extracted from open-domain data sources. The corpus, named DART, comprises over 82,000 instances, incorporating semantic triples derived from tables and other structured data forms. The authors build upon and consolidate data from different sources, enhancing the heterogeneity of the dataset.

Dataset Construction and Challenges Addressed

DART is created by merging three primary data sources: human-annotated tables from WikiTableQuestions, WikiSQL, and existing datasets such as WebNLG 2017 and Cleaned E2E. A novel aspect of this research is the annotation of tree ontologies, which offers a more semantically rich representation than previous flat-structured datasets. The use of a connected component extraction method ensures that the semantic dependencies between data elements are preserved and that natural language descriptions are consistent. This methodological innovation highlights the inherent complexity within tabular data representations and enables DART to challenge existing models by promoting out-of-domain generalization.

Evaluation and Numerical Results

The systematic evaluation of DART demonstrates its capability to set new benchmarks. The authors explore several state-of-the-art data-to-text models, including BART and T5, assessing their performance on DART in comparison to traditional datasets like WebNLG 2017. The results illustrate that DART demands higher generalization and adaptation capabilities from models due to its diverse and open-domain nature, with T5-large achieving the highest BLEU score of 50.66. Notably, when using DART for data augmentation, significant improvements are observed on other datasets, particularly in metrics like BLEU and METEOR, confirming DART's role in enhancing model robustness and accuracy across varied domains.

Implications and Future Directions

The implications of this research are manifold. Practically, DART will enable more effective natural language processing applications that depend on converting structured data into text, such as automated reporting and intelligent dialogue systems. Theoretically, it poses new questions about the interplay between data representation richness and textual generation quality. Future research may delve into developing new architectures that can better utilize the hierarchical and ontological structures presented in DART.

Furthermore, the authors underscore the importance of addressing semantic accuracy in generated text, as noted in their human evaluations of fluency and faithfulness. This opens a trajectory for exploring how semantic modeling and context-awareness can be integrated into neural architectures.

Conclusion

In conclusion, the introduction of DART marks an advance in the data-to-text generation landscape. The dataset not only broadens the operational scope of natural language generation tasks but also poses new challenges that promise to drive innovations in model design. As future developments in AI continue to push the boundaries of what is possible, DART sets a precedent for how complexity and diversity in data representation can enhance the textualization capability of computational models.

PDF Markdown

Related Papers

GitHub

GitHub - Yale-LILY/dart: Dataset for NAACL 2021 paper: "DART: Open-Domain Structured Data Record to Text Generation" (149 stars)