Unifying Structured Data as Graph for Data-to-Text Pre-Training (2401.01183v1)
Abstract: Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performances. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different data-to-text generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.
- Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565.
- Cont: Contrastive neural text generation. arXiv preprint arXiv:2205.14690.
- Table-to-text generation and pre-training with tabt5. arXiv preprint arXiv:2210.09162.
- Graph pre-training for amr parsing and generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6001–6015.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8681–8691.
- Deng Cai and Wai Lam. 2020. Graph transformer for graph-to-sequence learning. In AAAI.
- David L Chen and Raymond J Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning, pages 128–135.
- Wikitablet: A large-scale data-to-text dataset for generating wikipedia article sections. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 193–209.
- Logical natural language generation from open-domain tables. arXiv preprint arXiv:2004.10404.
- Kgpt: Knowledge-grounded pre-training for data-to-text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8635–8648.
- Control prefixes for text generation. arXiv preprint arXiv:2110.08329.
- Handling divergent reference texts when evaluating table-to-text generation. In ACL.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042–13054.
- Ondřej Dušek and Filip Jurčíček. 2016. Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. arXiv preprint arXiv:1606.05491.
- Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156.
- A faithful contrastive framework for response generation in tableqa systems. In International Conference on Database Systems for Advanced Applications, pages 197–212. Springer.
- Falcon: A faithful contrastive framework for response generation in tableqa systems. In International Conference on Database Systems for Advanced Applications, pages 197–212. Springer.
- Creating training corpora for nlg micro-planning. In 55th annual meeting of the Association for Computational Linguistics (ACL).
- Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association for Computational Linguistics, 7.
- Jiuzhou Han and Ehsan Shareghi. 2022. Self-supervised graph masking pre-training for graph-to-text generation. In Empirical Methods in Natural Language Processing 2022, pages 4845–4853. Association for Computational Linguistics (ACL).
- Tapas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333.
- Mihir Kale and Abhinav Rastogi. 2020. Text-to-text pre-training for data-to-text tasks. In Proceedings of the 13th International Conference on Natural Language Generation, pages 97–102, Dublin, Ireland. Association for Computational Linguistics.
- Jointgt: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2526–2538.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Text generation from knowledge graphs with graph transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2284–2293.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
- Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213.
- A diversity-promoting objective function for neural conversation models. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. The Association for Computational Linguistics.
- CATS: A pragmatic Chinese answer-to-sequence dataset with large scale and high quality. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2983–3000, Toronto, Canada. Association for Computational Linguistics.
- Cats: A pragmatic chinese answer-to-sequence dataset with large scale and high quality. arXiv preprint arXiv:2306.11477.
- Graph-to-text generation with dynamic structure pruning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6115–6127.
- Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Plog: Table-to-logic pretraining for logical table-to-text generation. arXiv preprint arXiv:2205.12697.
- Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4881–4888. AAAI Press.
- Improving compositional generalization with self-training for data-to-text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4205–4219.
- Dart: Open-domain structured data record to text generation. arXiv preprint arXiv:2007.02871.
- Reinforcement learning with imbalanced dataset for data-to-text medical report generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2223–2236.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
- Martin Schmitt1 Leonardo FR Ribeiro Philipp and Dufter1 Iryna Gurevych2 Hinrich Schütze. 2021. Modeling graph structure via relative position for text generation from knowledge graphs. NAACL-HLT 2021, page 10.
- Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
- Data-to-text generation with variational sequential planning. Transactions of the Association for Computational Linguistics, 10:697–715.
- Improving language understanding by generative pre-training.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- A hierarchical model for data-to-text generation. In European Conference on Information Retrieval, pages 65–80. Springer.
- Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Nat. Lang. Eng., 3(1):57–87.
- Investigating pretrained language models for graph-to-text generation. arXiv preprint arXiv:2007.08426.
- Investigating pretrained language models for graph-to-text generation. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, pages 211–227.
- Structural adapters in pretrained language models for amr-to-text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4269–4282.
- Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- G2t: Generating fluent descriptions for knowledge graph. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1861–1864.
- A study of translation error rate with targeted human annotation. Technical report, Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of ….
- A graph-to-sequence model for amr-to-text generation. In ACL.
- Plan-then-generate: Controlled data-to-text generation via planning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 895–909.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- Mvp: Multi-task supervised pre-training for natural language generation. arXiv preprint arXiv:2206.12131.
- Robust (controlled) table-to-text generation with structure-aware equivariance learning. arXiv preprint arXiv:2205.03972.
- Sketch and refine: Towards faithful and informative table-to-text generation. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4831–4843. Association for Computational Linguistics.
- Amr-to-text generation with graph transformer. Transactions of the Association for Computational Linguistics, 8:19–33.
- Xinyu Xing and Xiaojun Wan. 2021. Structure-aware pre-training for table-to-text generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2273–2278.
- CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1962–1979, Hong Kong, China. Association for Computational Linguistics.
- Giulio Zhou and Gerasimos Lampouras. 2020. WebNLG challenge 2020: Language agnostic delexicalisation for multilingual RDF-to-text generation. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pages 186–191, Dublin, Ireland (Virtual). Association for Computational Linguistics.
- Modeling graph structure in transformer for better amr-to-text generation. In EMNLP-IJCNLP.