Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Self-training from Self-memory in Data-to-text Generation (2401.10567v1)

Published 19 Jan 2024 in cs.CL

Abstract: This paper introduces a novel training model, self-training from self-memory (STSM) in data-to-text generation (DTG), allowing the model to self-train on subsets, including self-memory as outputs inferred directly from the trained models and/or the new data. The quality of self-memory is validated by two models, data-to-text (D2T) and text-to-data (T2D), by two pre-defined conditions: (1) the appearance of all source values in the outputs of the D2T model and (2) the ability to convert back to source data in the outputs in the T2D model. We utilize a greedy algorithm to generate shorter D2T outputs if they contain all source values. Subsequently, we use the T2D model to confirm that these outputs can capture input relationships by demonstrating their capacity to convert text back into data. With 30% of the dataset, we can train the D2T model with a competitive performance compared to full training in the same setup. We experiment with our model on two datasets, E2E NLG and DART. STSM offers the D2T model a generalization capability from its subset memory while reducing training data volume. Ultimately, we anticipate that this paper will contribute to continual learning solutions that adapt to new training data, incorporating it as a form of self-memory in DTG tasks. The curated dataset is publicly available at: https://github.com/hoangthangta/STSM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A brief overview of chatgpt: The history, status quo and potential future development, IEEE/CAA Journal of Automatica Sinica 10 (2023) 1122–1136.
  2. The flan collection: Designing data and methods for effective instruction tuning, arXiv preprint arXiv:2301.13688 (2023).
  3. Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
  4. Choosing words in computer-generated weather forecasts, Artificial Intelligence 167 (2005) 137–169.
  5. Template-free data-to-text generation of finnish sports news, arXiv preprint arXiv:1910.01863 (2019).
  6. Learning with contrastive examples for data-to-text generation, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 2352–2362.
  7. Data-to-text generation with entity modeling, arXiv preprint arXiv:1906.03221 (2019).
  8. Plan-then-generate: Controlled data-to-text generation via planning, arXiv preprint arXiv:2108.13740 (2021).
  9. Challenges in data-to-document generation, arXiv preprint arXiv:1707.08052 (2017).
  10. Neural text generation from structured data with application to the biography domain, arXiv preprint arXiv:1603.07771 (2016).
  11. Dart: Open-domain structured data record to text generation, arXiv preprint arXiv:2007.02871 (2020).
  12. Findings of the E2E NLG challenge, arXiv preprint arXiv:1810.01170 (2018).
  13. E. Reiter, An architecture for data-to-text systems, in: proceedings of the eleventh European workshop on natural language generation (ENLG 07), 2007, pp. 97–104.
  14. A. Belz, E. Kow, System building cost vs. output quality in data-to-text generation, in: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), 2009, pp. 16–24.
  15. Statistical natural language generation from tabular non-textual data, in: Proceedings of the 9th international natural language generation conference, 2016, pp. 143–152.
  16. Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014).
  17. Sequence to sequence learning with neural networks, Advances in neural information processing systems 27 (2014).
  18. Neural data-to-text generation: A comparison between pipeline and end-to-end architectures, arXiv preprint arXiv:1908.09022 (2019).
  19. Attention is all you need, Advances in neural information processing systems 30 (2017).
  20. A hierarchical model for data-to-text generation, in: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I 42, Springer, 2020, pp. 65–80.
  21. Key fact as pivot: A two-stage model for low resource table-to-text generation, arXiv preprint arXiv:1908.03067 (2019).
  22. Sketch and refine: Towards faithful and informative table-to-text generation, arXiv preprint arXiv:2105.14778 (2021).
  23. R. Puduppully, M. Lapata, Data-to-text generation with macro planning, Transactions of the Association for Computational Linguistics 9 (2021) 510–527.
  24. Brio: Bringing order to abstractive summarization, arXiv preprint arXiv:2203.16804 (2022).
  25. Extractive summarization as text matching, arXiv preprint arXiv:2004.08795 (2020).
  26. Have your text and use it too! end-to-end neural data-to-text generation with semantic fidelity, arXiv preprint arXiv:2004.06577 (2020).
  27. Improving compositional generalization with self-training for data-to-text generation, arXiv preprint arXiv:2110.08467 (2021).
  28. Curriculum-based self-training makes better few-shot learners for data-to-text generation, arXiv preprint arXiv:2206.02712 (2022).
  29. Logen: few-shot logical knowledge-conditioned text generation with self-training, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
  30. A simple recipe towards reducing hallucination in neural surface realisation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2673–2679.
  31. C. Kedzie, K. McKeown, A good sample is hard to find: Noise injection sampling and self-training for neural language generation models, arXiv preprint arXiv:1911.03373 (2019).
  32. Faithful low-resource data-to-text generation through cycle training, arXiv preprint arXiv:2305.14793 (2023).
  33. Improving graph-to-text generation using cycle training, in: Proceedings of the 4th Conference on Language, Data and Knowledge, 2023, pp. 256–261.
  34. Cyclegt: Unsupervised graph-to-text and text-to-graph generation via cycle training, arXiv preprint arXiv:2006.04702 (2020).
  35. The e2e dataset: New challenges for end-to-end generation, arXiv preprint arXiv:1706.09254 (2017).
  36. Crowd-sourcing nlg data: Pictures elicit better data, arXiv preprint arXiv:1608.00339 (2016).
  37. Describing a knowledge base, arXiv preprint arXiv:1809.01797 (2018).
  38. Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  39. Cider: Consensus-based image description evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
  40. S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  41. A. Lavie, M. J. Denkowski, The METEOR metric for automatic evaluation of machine translation, Machine translation 23 (2009) 105–115.
  42. G. Doddington, Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, in: Proceedings of the second international conference on Human Language Technology Research, 2002, pp. 138–145.
  43. C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
  44. Fluency, adequacy, or hter? exploring different human judgments with a tunable mt metric, in: Proceedings of the fourth workshop on statistical machine translation, 2009, pp. 259–268.
  45. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv:1910.13461 (2019).
  46. Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551.
  47. Pragmatically informative text generation, arXiv preprint arXiv:1904.01301 (2019).
  48. Copy mechanism and tailored training for character-based data-to-text generation, in: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II, Springer, 2020, pp. 648–664.
  49. A deep ensemble model with slot alignment for sequence-to-sequence natural language generation, arXiv preprint arXiv:1805.06553 (2018).
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces STSM, a novel self-training paradigm that integrates self-memory to efficiently convert structured data into natural language.
  • It employs a dual-model architecture (D2T and T2D) with a greedy algorithm to ensure all source values are accurately preserved and reconstructed.
  • Experiments on the E2E NLG and DART datasets show competitive performance with reduced data requirements, highlighting robust generalization.

Self-training from Self-memory in Data-to-text Generation

Introduction

The paper "Self-training from Self-memory in Data-to-text Generation" explores a novel training paradigm in the field of data-to-text generation (DTG). This research introduces a model known as self-training from self-memory (STSM), which enables a DTG model to self-train using subsets that include both self-memory—outputs inferred directly from previously trained models—and novel data. The methodology seeks to optimize model performance while minimizing the volume of training data required.

Methodology

Self-training Model Architecture

The STSM model architecture consists of two critical models: data-to-text (D2T) and text-to-data (T2D). The D2T model converts structured data (e.g., tables, meaning representations, knowledge graphs) into natural language, while the T2D model performs the reverse operation, converting generated text back into structured data. The architecture leverages self-memory and an optimization step that utilizes a greedy algorithm to verify the integrity and conciseness of generated outputs.

Two predefined conditions validate the quality of self-memory: (1) all source values must appear in the outputs of the D2T model, and (2) the outputs must reconstruct the original data via the T2D model.

Training Process

The training comprises generating shorter D2T outputs which include all source values using a greedy methodology. These outputs are validated using the T2D model to verify the preservation of input data relationships. With only 30% of the dataset, the D2T model achieves performance comparable to full-data training, mitigating the risk of catastrophic forgetting typically observed in standard retraining processes.

The STSM exploits twin approaches to data management—using fixed and random data subsets across multiple epochs—introducing diverse self-training strategies without over-reliance on large memory or comprehensive data amounts.

Experimental Results

Datasets

Experiments were conducted on the E2E NLG and DART datasets. E2E NLG focuses on generating texts for restaurant-related content, while DART involves more diverse domain applications with triplesets as inputs. Linearization of input data transforms MRs or triplesets into strings that sequenced models can handle.

Performance Metrics

Evaluation metrics such as BLEU, METEOR, and TER determined the effectiveness of the STSM model. Despite using smaller subsets of data (30% per epoch), the model's output quality remained competitive with conventional full-data training approaches.

Comparisons and Outcomes

The STSM model demonstrated robust generalization capabilities with reduced training data requirements. Fixed data subsets yielded slightly better metric values than random subsets, suggesting greater consistency in retaining learned knowledge. Among various self-mem methodologies, combinations of self-memory and new data yielded optimal results, emphasizing STSM's strength in minimizing data requirements while maintaining performance.

Implications and Future Work

The introduction of STSM offers a pathway to efficient self-training, adaptable to evolving datasets and continual learning scenarios. This model potentially reduces computational demands, overcoming constraints associated with large pre-trained models. Future research directions include exploring the integration of larger LLMs (e.g., BART-large, Llama) and expanding the evaluation to additional datasets such as WebNLG and ToTTo. Addressing the optimal balance between self-memory and novel data remains an open question, poised to further refine self-training techniques.

Conclusion

The STSM framework provides an innovative solution to data-to-text generation by effectively combining self-memory with new data to reduce training data volume significantly. This self-training methodology achieves competitive performance without compromising output quality, positioning it as a potent tool for future advancements in natural language generation. Through continuous refinement and expanded evaluation, the STSM model promises to contribute substantially to efficient and scalable AI systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Authors (1)

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets