Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 30 tok/s

GPT-5 High 32 tok/s Pro

GPT-4o 95 tok/s

GPT OSS 120B 469 tok/s Pro

Kimi K2 212 tok/s Pro

2000 character limit reached

Self-training from Self-memory in Data-to-text Generation (2401.10567v1)

Published 19 Jan 2024 in cs.CL

Abstract: This paper introduces a novel training model, self-training from self-memory (STSM) in data-to-text generation (DTG), allowing the model to self-train on subsets, including self-memory as outputs inferred directly from the trained models and/or the new data. The quality of self-memory is validated by two models, data-to-text (D2T) and text-to-data (T2D), by two pre-defined conditions: (1) the appearance of all source values in the outputs of the D2T model and (2) the ability to convert back to source data in the outputs in the T2D model. We utilize a greedy algorithm to generate shorter D2T outputs if they contain all source values. Subsequently, we use the T2D model to confirm that these outputs can capture input relationships by demonstrating their capacity to convert text back into data. With 30% of the dataset, we can train the D2T model with a competitive performance compared to full training in the same setup. We experiment with our model on two datasets, E2E NLG and DART. STSM offers the D2T model a generalization capability from its subset memory while reducing training data volume. Ultimately, we anticipate that this paper will contribute to continual learning solutions that adapt to new training data, incorporating it as a form of self-memory in DTG tasks. The curated dataset is publicly available at: https://github.com/hoangthangta/STSM.

References (49)

Collections

Summary

The paper introduces STSM, a novel self-training paradigm that integrates self-memory to efficiently convert structured data into natural language.
It employs a dual-model architecture (D2T and T2D) with a greedy algorithm to ensure all source values are accurately preserved and reconstructed.
Experiments on the E2E NLG and DART datasets show competitive performance with reduced data requirements, highlighting robust generalization.

Self-training from Self-memory in Data-to-text Generation

Introduction

The paper "Self-training from Self-memory in Data-to-text Generation" explores a novel training paradigm in the field of data-to-text generation (DTG). This research introduces a model known as self-training from self-memory (STSM), which enables a DTG model to self-train using subsets that include both self-memory—outputs inferred directly from previously trained models—and novel data. The methodology seeks to optimize model performance while minimizing the volume of training data required.

Methodology

Self-training Model Architecture

The STSM model architecture consists of two critical models: data-to-text (D2T) and text-to-data (T2D). The D2T model converts structured data (e.g., tables, meaning representations, knowledge graphs) into natural language, while the T2D model performs the reverse operation, converting generated text back into structured data. The architecture leverages self-memory and an optimization step that utilizes a greedy algorithm to verify the integrity and conciseness of generated outputs.

Two predefined conditions validate the quality of self-memory: (1) all source values must appear in the outputs of the D2T model, and (2) the outputs must reconstruct the original data via the T2D model.

Training Process

The training comprises generating shorter D2T outputs which include all source values using a greedy methodology. These outputs are validated using the T2D model to verify the preservation of input data relationships. With only 30% of the dataset, the D2T model achieves performance comparable to full-data training, mitigating the risk of catastrophic forgetting typically observed in standard retraining processes.

The STSM exploits twin approaches to data management—using fixed and random data subsets across multiple epochs—introducing diverse self-training strategies without over-reliance on large memory or comprehensive data amounts.

Experimental Results

Datasets

Experiments were conducted on the E2E NLG and DART datasets. E2E NLG focuses on generating texts for restaurant-related content, while DART involves more diverse domain applications with triplesets as inputs. Linearization of input data transforms MRs or triplesets into strings that sequenced models can handle.

Performance Metrics

Evaluation metrics such as BLEU, METEOR, and TER determined the effectiveness of the STSM model. Despite using smaller subsets of data (30% per epoch), the model's output quality remained competitive with conventional full-data training approaches.

Comparisons and Outcomes

The STSM model demonstrated robust generalization capabilities with reduced training data requirements. Fixed data subsets yielded slightly better metric values than random subsets, suggesting greater consistency in retaining learned knowledge. Among various self-mem methodologies, combinations of self-memory and new data yielded optimal results, emphasizing STSM's strength in minimizing data requirements while maintaining performance.

Implications and Future Work

The introduction of STSM offers a pathway to efficient self-training, adaptable to evolving datasets and continual learning scenarios. This model potentially reduces computational demands, overcoming constraints associated with large pre-trained models. Future research directions include exploring the integration of larger LLMs (e.g., BART-large, Llama) and expanding the evaluation to additional datasets such as WebNLG and ToTTo. Addressing the optimal balance between self-memory and novel data remains an open question, poised to further refine self-training techniques.

Conclusion

The STSM framework provides an innovative solution to data-to-text generation by effectively combining self-memory with new data to reduce training data volume significantly. This self-training methodology achieves competitive performance without compromising output quality, positioning it as a potent tool for future advancements in natural language generation. Through continuous refinement and expanded evaluation, the STSM model promises to contribute substantially to efficient and scalable AI systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (1)

Hoang-Thang Ta

GitHub

GitHub - hoangthangta/STSM: Self-training from Self-memory (1 star)