Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Structured Data as Graph for Data-to-Text Pre-Training (2401.01183v1)

Published 2 Jan 2024 in cs.CL and cs.AI

Abstract: Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performances. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different data-to-text generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565.
  2. Cont: Contrastive neural text generation. arXiv preprint arXiv:2205.14690.
  3. Table-to-text generation and pre-training with tabt5. arXiv preprint arXiv:2210.09162.
  4. Graph pre-training for amr parsing and generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6001–6015.
  5. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  6. Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8681–8691.
  7. Deng Cai and Wai Lam. 2020. Graph transformer for graph-to-sequence learning. In AAAI.
  8. David L Chen and Raymond J Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning, pages 128–135.
  9. Wikitablet: A large-scale data-to-text dataset for generating wikipedia article sections. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 193–209.
  10. Logical natural language generation from open-domain tables. arXiv preprint arXiv:2004.10404.
  11. Kgpt: Knowledge-grounded pre-training for data-to-text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8635–8648.
  12. Control prefixes for text generation. arXiv preprint arXiv:2110.08329.
  13. Handling divergent reference texts when evaluating table-to-text generation. In ACL.
  14. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042–13054.
  15. Ondřej Dušek and Filip Jurčíček. 2016. Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. arXiv preprint arXiv:1606.05491.
  16. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156.
  17. A faithful contrastive framework for response generation in tableqa systems. In International Conference on Database Systems for Advanced Applications, pages 197–212. Springer.
  18. Falcon: A faithful contrastive framework for response generation in tableqa systems. In International Conference on Database Systems for Advanced Applications, pages 197–212. Springer.
  19. Creating training corpora for nlg micro-planning. In 55th annual meeting of the Association for Computational Linguistics (ACL).
  20. Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association for Computational Linguistics, 7.
  21. Jiuzhou Han and Ehsan Shareghi. 2022. Self-supervised graph masking pre-training for graph-to-text generation. In Empirical Methods in Natural Language Processing 2022, pages 4845–4853. Association for Computational Linguistics (ACL).
  22. Tapas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333.
  23. Mihir Kale and Abhinav Rastogi. 2020. Text-to-text pre-training for data-to-text tasks. In Proceedings of the 13th International Conference on Natural Language Generation, pages 97–102, Dublin, Ireland. Association for Computational Linguistics.
  24. Jointgt: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2526–2538.
  25. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  26. Text generation from knowledge graphs with graph transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2284–2293.
  27. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
  28. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213.
  29. A diversity-promoting objective function for neural conversation models. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. The Association for Computational Linguistics.
  30. CATS: A pragmatic Chinese answer-to-sequence dataset with large scale and high quality. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2983–3000, Toronto, Canada. Association for Computational Linguistics.
  31. Cats: A pragmatic chinese answer-to-sequence dataset with large scale and high quality. arXiv preprint arXiv:2306.11477.
  32. Graph-to-text generation with dynamic structure pruning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6115–6127.
  33. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99.
  34. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  35. Plog: Table-to-logic pretraining for logical table-to-text generation. arXiv preprint arXiv:2205.12697.
  36. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4881–4888. AAAI Press.
  37. Improving compositional generalization with self-training for data-to-text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4205–4219.
  38. Dart: Open-domain structured data record to text generation. arXiv preprint arXiv:2007.02871.
  39. Reinforcement learning with imbalanced dataset for data-to-text medical report generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2223–2236.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  41. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
  42. Martin Schmitt1 Leonardo FR Ribeiro Philipp and Dufter1 Iryna Gurevych2 Hinrich Schütze. 2021. Modeling graph structure via relative position for text generation from knowledge graphs. NAACL-HLT 2021, page 10.
  43. Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  44. Data-to-text generation with variational sequential planning. Transactions of the Association for Computational Linguistics, 10:697–715.
  45. Improving language understanding by generative pre-training.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  47. A hierarchical model for data-to-text generation. In European Conference on Information Retrieval, pages 65–80. Springer.
  48. Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Nat. Lang. Eng., 3(1):57–87.
  49. Investigating pretrained language models for graph-to-text generation. arXiv preprint arXiv:2007.08426.
  50. Investigating pretrained language models for graph-to-text generation. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, pages 211–227.
  51. Structural adapters in pretrained language models for amr-to-text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4269–4282.
  52. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280.
  53. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
  54. G2t: Generating fluent descriptions for knowledge graph. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1861–1864.
  55. A study of translation error rate with targeted human annotation. Technical report, Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of ….
  56. A graph-to-sequence model for amr-to-text generation. In ACL.
  57. Plan-then-generate: Controlled data-to-text generation via planning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 895–909.
  58. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
  59. Mvp: Multi-task supervised pre-training for natural language generation. arXiv preprint arXiv:2206.12131.
  60. Robust (controlled) table-to-text generation with structure-aware equivariance learning. arXiv preprint arXiv:2205.03972.
  61. Sketch and refine: Towards faithful and informative table-to-text generation. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4831–4843. Association for Computational Linguistics.
  62. Amr-to-text generation with graph transformer. Transactions of the Association for Computational Linguistics, 8:19–33.
  63. Xinyu Xing and Xiaojun Wan. 2021. Structure-aware pre-training for table-to-text generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2273–2278.
  64. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1962–1979, Hong Kong, China. Association for Computational Linguistics.
  65. Giulio Zhou and Gerasimos Lampouras. 2020. WebNLG challenge 2020: Language agnostic delexicalisation for multilingual RDF-to-text generation. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pages 186–191, Dublin, Ireland (Virtual). Association for Computational Linguistics.
  66. Modeling graph structure in transformer for better amr-to-text generation. In EMNLP-IJCNLP.
Citations (9)

Summary

  • The paper presents a novel approach that unifies diverse structured data as a graph for effective data-to-text pre-training.
  • It leverages a modified Transformer with custom position and attention matrices to capture graph structures and relationships.
  • Extensive experiments on six benchmark datasets demonstrate significant improvements in BLEU and PARENT scores over existing baselines.

Unifying Structured Data as Graph for Data-to-Text Pre-Training

The task of data-to-text (D2T) generation, which focuses on transforming structured data into coherent natural language text, represents a significant facet of natural language processing with applications in diverse domains such as journalism, medical diagnosis, and finance. While previous approaches to D2T pre-training either disregarded the inherent structures of input data or focused exclusively on specific data configurations, this paper proposes a novel approach that unifies various structured data types into a graph format, casting D2T tasks as graph-to-text generation problems.

The researchers introduce the UniD2T model, a structure-enhanced pre-training method leveraging a modified Transformer architecture to better capture graph structures. This is accomplished by integrating novel position matrices to encode relative positional information of connected nodes, and attention matrices that incorporate explicit connectivity structures within the Transformer.

Contributions and Methodology

  1. Data Unification into Graph Format: The paper addresses the challenge of dealing with diverse structured data by converting different types—tables, key-value pairs, knowledge graphs—into a unified graph format. This graph-centric representation maintains the structural intricacies of the original data and allows for consistent treatment of diverse input forms.
  2. Structure-Enhanced Pre-Training: Building on the T5 model, this approach introduces an augmented Transformer architecture capable of encoding graph structures. By introducing bespoke position and attention matrices, the model is adapted to capture the intricacies of graph relationships, thus enhancing the representation of input structured data.
  3. Extensive Experimental Validation: The efficacy of the UniD2T model is demonstrated through a comprehensive series of experiments conducted on six benchmark datasets representing different D2T tasks. The results indicate a notable enhancement in performance over existing baselines, confirming the effectiveness of this unification strategy.

Numerical Results and Implications

Through its innovative approach, the UniD2T model consistently outperformed strong baseline models across all six datasets. The application of graph structures, as opposed to the oversimplified sequence treatment, allowed the model to achieve substantial improvements in BLEU and PARENT scores, showcasing the importance of structural information in D2T tasks. Moreover, the results reveal the benefit of leveraging both pre-training and fine-tuning methodologies customized to align with graph-based data representation.

Future Directions

The propositions of this paper set a foundation for further advancements in the field of AI-tuned D2T generation systems. The unification of structured data into a graph format not only demonstrates improved model performance but also suggests potential avenues for exploring more sophisticated graph encoding strategies, integrating larger and more diverse pre-training datasets, and expanding the array of pre-training objectives to further refine model understanding and generation capabilities.

The findings support the continued exploration of unified frameworks for D2T tasks, particularly those that accommodate the structural diversity inherent in real-world data. As researchers propel these methodologies forward, subsequent investigations might focus on refining the graph construction processes and exploring additional layers of semantic understanding to enhance model performance and adaptability further.

In summary, the UniD2T model exemplifies a judicious and effective way to harness structural information within D2T tasks, offering a robust framework that not only outperforms traditional methodologies but also provides a versatile foundation for future innovation in the data-to-text generation domain.