LMTransplant: LLM-Driven Text Augmentation

Updated 22 August 2025

LMTransplant is a prompt-driven paradigm that employs a two-phase transplant-then-regenerate process to create semantically enriched, diverse text variants while preserving core attributes.
It leverages LLMs' knowledge emergence by embedding seed text into contextually expanded passages before regenerating enhanced content, outperforming traditional lexical-only augmentation methods.
Empirical evaluations show higher lexical diversity and improved downstream task metrics, underlining its scalability and effectiveness in content-level augmentation.

LMTransplant denotes a prompt-driven, large-language-model (LLM)-centric paradigm for text data augmentation whose foundation is a two-stage "transplant-then-regenerate" process. Unlike traditional augmentation (e.g., back-translation or lexical rephrasing) that only weakly perturbs the original text, LMTransplant first embeds ("transplants") the seed text into a contextually expanded passage generated by an LLM, then prompts the LLM to regenerate a new variant of the original portion using that enriched context. This mechanism exploits both the knowledge emergence and generative capacity of LLMs, enabling the creation of diverse, creative, and semantically consistent content-level variants that preserve core attributes essential for downstream tasks. LMTransplant demonstrates robust performance and pronounced scalability compared to prior methods (Wang et al., 20 Aug 2025).

1. Transplant-Then-Regenerate Process

The LMTransplant workflow is structured into two sequential phases:

Transplant Phase

Seed text is viewed as a fragment embedded in a larger passage. The LLM is prompted to generate bidirectional context:

Forward continuation: Generates a natural subsequent sentence that follows the seed text.
Backward continuation: Generates a preceding sentence that sets up the seed text. The resulting prompt comprises:
1 2 3
Preceding Sentence: [generated] Original Text: [seed] Subsequent Sentence: [generated]
This yields a contextually expanded passage, the seed text now situated among logically coherent adjacent content.

Regeneration Phase

The seed text, within its new passage, is replaced by an LLM-generated variant. The LLM is prompted to produce a new “middle” text segment. Constraints in the prompt ensure:

The regenerated content fits the surrounding passage logically.
Length, format, and style align with the original.
Label/sentiment (e.g., Positive, Numeric) is preserved for structured tasks.
Vocabulary and syntactic structure introduce diversity beyond basic rewording.

This two-stage framework enables LLMs to synthesize content-level transformations by using the expanded context as an anchor, allowing knowledge emergence unobtainable via simple lexical or syntactic manipulations.

2. Leveraging LLM Knowledge Emergence for Diversity

LMTransplant capitalizes on LLMs' capacity to produce content-rich, semantically novel language:

During regeneration, context-rich prompts induce the generation of new facts, expressions, or even domain-specific terms.
This process can yield outputs where, for example, "our galaxy" becomes "the Milky Way" or a movie review is extended to reflect broader positive sentiment, outperforming models that simply rephrase the original.

Traditional methods such as back-translation mostly yield trivial permutations ("How big is our galaxy?" to "How great is our galaxy?"); LMTransplant achieves creative recombinations driven by world knowledge encoded in LLMs, while prompt constraints enforce semantic and label consistency.

3. Quantitative Evaluation of Diversity and Fidelity

To measure the intrinsic diversity and controllable fidelity of augmented texts, LMTransplant adopts:

Distinct-N:

$\text{Distinct-N} = \frac{\text{Number of unique }n\text{-grams}}{\text{Total number of }n\text{-grams}}$ Higher values indicate pronounced lexical diversity.

Semantic Variability:

$\text{Semantic Variability} = 1 - \text{BERTScore}$

Computed via BERTScore (Wang et al., 20 Aug 2025), this quantifies how much new semantic content the variant introduces (lower values indicate close similarity).

Label consistency is explicitly monitored during generation because prompts enforce the preservation of label or classification attributes.

Reported results show LMTransplant achieves higher Distinct-N and increased semantic variability than baseline approaches, while preserving core structural and label attributes.

4. Scalability and Downstream Task Performance

LMTransplant demonstrates exceptional scalability:

As the volume of seed texts and the number of variants per seed grows, downstream model performance (accuracy, macro F1–score, etc.) continues to improve.
For classification, QA, and NER, the gain is particularly pronounced in low-resource scenarios.
The prompt-driven architecture means adaptation to new domains or new label constraints is straightforward, without needing to retrain the entire augmentation pipeline.

Empirical evaluations indicate augmented datasets built with LMTransplant enhance model generalization, outperforming those produced with traditional augmentation.

5. Impact on Content-Level Augmentation and Core Attribute Preservation

LMTransplant substantially improves content-level diversity, providing augmentation that enriches model training in ways not possible with lexical-rephrasing schemes. It balances the introduction of knowledge-rich and creative output with mechanisms for retaining label and semantic fidelity.

Augmented text can incorporate logical or factual details directly related to but not explicit in the original (e.g., domain expansion, synonym substitution, narrative enrichment).
Prompt-level constraints and context-aware regeneration ensure that the augmented samples do not inadvertently shift class or sentiment, which is a common failure in naive augmentation strategies.

This suggests LMTransplant is particularly suited for training models in tasks requiring robust semantic coverage (e.g., question answering, entity recognition, and nuanced sentiment classification).

6. Methodological Considerations and Limitations

Prompt engineering is fundamental: carefully constructed prompts ensure the desired variety and attribute preservation.
The approach relies on LLMs’ underlying pretrained knowledge; performance depends on the expressiveness and knowledge coverage of the base LLM.
Although demonstrated to scale efficiently, computational cost increases with model size and augmented sample volume; batching and sampling strategies may be required in production.

A plausible implication is that further optimization in prompt construction and automatic constraint satisfaction can yield yet greater augmentation diversity without loss of fidelity.

7. Summary

LMTransplant is a prompt-based, LLM-powered data augmentation paradigm that executes a transplant-then-regenerate sequence to produce diverse, content-rich text variants. This method utilizes knowledge emergence in LLMs to enrich data for machine learning models, surpassing traditional lexical-level augmentation by systematically generating contextually and semantically novel text, robustly preserving essential attributes, and demonstrating scalability across domain and dataset sizes. Empirical validation confirms superior performance and practical applicability in contemporary NLP pipelines (Wang et al., 20 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation (2025)

Follow Topic

Get notified by email when new papers are published related to LMTransplant.