CycleGT: Unsupervised Graph-to-Text and Text-to-Graph Generation via Cycle Training (2006.04702v3)

Published 8 Jun 2020 in cs.CL, cs.AI, and cs.LG

Abstract: Two important tasks at the intersection of knowledge graphs and natural language processing are graph-to-text (G2T) and text-to-graph (T2G) conversion. Due to the difficulty and high cost of data collection, the supervised data available in the two fields are usually on the magnitude of tens of thousands, for example, 18K in the WebNLG~2017 dataset after preprocessing, which is far fewer than the millions of data for other tasks such as machine translation. Consequently, deep learning models for G2T and T2G suffer largely from scarce training data. We present CycleGT, an unsupervised training method that can bootstrap from fully non-parallel graph and text data, and iteratively back translate between the two forms. Experiments on WebNLG datasets show that our unsupervised model trained on the same number of data achieves performance on par with several fully supervised models. Further experiments on the non-parallel GenWiki dataset verify that our method performs the best among unsupervised baselines. This validates our framework as an effective approach to overcome the data scarcity problem in the fields of G2T and T2G. Our code is available at https://github.com/QipengGuo/CycleGT.

PDF Abstract

Unsupervised Graph-to-Text and Text-to-Graph Generation via Cycle Training

The research paper by Qipeng Guo et al. presents an innovative model for addressing a prevalent challenge in natural language processing: the conversion between knowledge graphs and textual descriptions, specifically for Graph-to-Text (G2T) and Text-to-Graph (T2G) tasks. The key contribution here is the introduction of an unsupervised method, termed cycle training, which mitigates the data scarcity issue often encountered with such tasks.

Problem Context

Knowledge graphs serve as a robust mechanism for knowledge representation, extensively applied across various NLP applications. G2T tasks aim to translate structured information from knowledge graphs into coherent textual descriptions, while T2G tasks facilitate the extraction of structured relational graphs from textual data. Both tasks are crucial yet hindered by the unavailability of extensive parallel corpora, typically involving costly data annotation processes. Current datasets, such as WebNLG with approximately 18K text-graph pairs, are significantly smaller than datasets used for tasks like neural machine translation (NMT).

Methodological Framework

This research formulates G2T and T2G as cycle-training problems, leveraging bidirectional transformations between graphs and text. The core proposition is a training framework named Cycle Training that employs unsupervised learning to iteratively bridge the transformation gap between non-parallel datasets of text and graph structures.

G2T Component: Utilizes pretrained models like T5 to generate text sequences from linearized graph sequences.
T2G Component: Implements a BiLSTM framework augmented with a multi-label classifier to infer relationships between extracted entities, thereby constructing knowledge graph triples from text.

The cycle training framework is then realized through an iterative reinforcement process where models are optimized through bidirectional transformations, termed as cycle consistency losses. This process emulates a pseudo-supervised setting where non-parallel data serves as a proxy for learning transformations.

Experimental Evaluation

The model is rigorously evaluated using datasets such as WebNLG 2017, WebNLG 2020, and GenWiki. Results showcase that, remarkably, the unsupervised approach achieves near-parity with several supervised models on benchmark datasets. Specifically, when evaluated on WebNLG 2017, the model achieves a BLEU score of 55.5, closely approaching the performance of an in-domain supervised model. Performance on the GenWiki dataset, which lacks direct parallel pairings, further accentuates the method's superiority over existing unsupervised models, achieving improvements of over 10 BLEU points on various dataset configurations.

Implications and Future Directions

This research proffers compelling evidence for the potential of unsupervised models in tasks traditionally dependent on large supervised datasets. The cycle training framework holds promise for scalability and wider applicability across domains where labeled data is sparse.

Furthermore, the paper underscores potential advancements in unsupervised learning paradigms, hinting at future research trajectories involving more sophisticated cycle frameworks or integration with other forms of domain adaptation techniques. As the field progresses, the merging of unsupervised techniques and pretrained models could redefine the boundaries of what can be achieved in resource-constrained environments.

In conclusion, the paper by Guo et al. contributes significantly to the NLP community's efforts to transcend data constraints, offering a robust framework capable of bringing the efficiency and efficacy of T2G and G2T tasks on par with, if not surpassing, traditional supervised methods.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Qipeng Guo (72 papers)
Zhijing Jin (68 papers)
Xipeng Qiu (257 papers)
Weinan Zhang (322 papers)
David Wipf (59 papers)
Zheng Zhang (486 papers)

Citations (59)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - QipengGuo/CycleGT: code of CycleGT (87 stars)