The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text (2311.09807v2)

Published 16 Nov 2023 in cs.CL

Abstract: This study investigates the consequences of training LLMs on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training LLMs on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.

PDF Abstract

Linguistic Diversity Impact from Training LLMs on Synthetic Text

The paper "The Curious Decline of Linguistic Diversity: Training LLMs on Synthetic Text" presents a focused analysis on the implications of employing synthetic data for training LLMs, scrutinizing this practice through the lens of linguistic diversity. This work explores the gradual degradation of linguistic diversity—encompassing lexical, syntactic, and semantic dimensions—when LLMs are recursively trained with synthetic text generated by their predecessors. This paper shifts away from traditional model performance metrics, concentrating instead on the broader linguistic implications.

Methodology

The authors simulate a recursive training process using synthetic data, a scenario likely to become more prevalent as the demand for massive data volumes increases in machine learning. Given the finite supply of high-quality natural text, synthetic text emerges as a solution despite its potential complexities. The paper leverages a recursive fine-tuning method across three natural language generation (NLG) tasks that vary in creativity constraints: news summarization, scientific abstract generation, and story generation. This approach allows the authors to systematically observe changes in linguistic outputs over multiple training iterations using synthetic data generated by earlier models in the sequence.

Novel metrics have been introduced to quantify linguistic diversity at different levels. Lexical diversity metrics include Type-Token Ratio (TTR) and Distinct- $n$ for different $n$ -gram levels, while Self-BLEU is used to gauge repetitive patterns in generation. Semantic diversity is assessed using sentence embedding techniques, examining pairwise cosine distances between embeddings. Syntactic diversity is evaluated using dependency parse trees and the Weisfeiler-Lehman graph kernel, offering a quantitative measure of syntax variance.

Key Findings

The results consistently indicate a decline in linguistic diversity across all tasks over successive iterations, despite training on the same model architecture and base data. This degradation is most pronounced in high-entropy tasks like story generation, where the creative freedom results in greater initial diversity, making the impact of recursive synthesis pronounced. Lexical diversity diminishes similarly in lower-entropy tasks, emphasizing the pitfalls of recursive synthetic training where initial constraints guide but don't prevent diversity loss. Notably, syntactic diversity also significantly decreases, suggesting that model recursiveness particularly hampers structural linguistic facets.

Implications and Future Directions

This research underscores a potentially concerning trajectory: while synthetic data can mitigate data scarcity, it may risk eroding linguistic variation over time, particularly within recursive training frameworks. The findings suggest urgent consideration for methodologies that balance data volume needs with the preservation of linguistic richness. Beyond metrics of task accuracy and fluency, future investigations should integrate comprehensive linguistic diversity assessments to better understand how training practices shape LLM capabilities. Furthermore, the paper highlights the necessity of diverse, high-quality natural datasets, possibly supplemented by strategies like data augmentation or adversarial training, to counteract synthetic data limitations.

In the broader context of AI development, these insights prompt a re-evaluation of training paradigms in pursuit of balancing scale with the multifaceted dimensions of language. This work serves as an impetus for future research to strategically align model objectives with linguistic integrity, ensuring LLMs evolve with robust and varied language understanding capabilities.