PromDA: Prompt-Based Data Augmentation
- PromDA is a prompt-based data augmentation framework for low-resource natural language understanding that uses soft prompts atop a frozen pre-trained language model.
- It employs dual-view prompt-tuning with cross-feed conditional generation to create diverse and high-quality synthetic data without relying on additional unlabeled text.
- The framework enhances accuracy through iterative NLU filtering, achieving significant performance gains over classical and semi-supervised augmentation strategies.
PromDA is a prompt-based data augmentation framework designed for low-resource natural language understanding (NLU) tasks. By training only a small set of soft prompts atop a frozen pre-trained LLM (PLM), PromDA generates large-scale, high-quality synthetic data in the absence of any additional unlabeled in-domain text. The approach leverages joint prompt-tuning and dual-view conditional generation, with downstream filtering by a supervised NLU model to ensure quality and task consistency. This leads to statistically significant performance gains over both classical and semi-supervised augmentation strategies, particularly in the few-shot regime (Wang et al., 2022).
1. Soft Prompt Model Architecture
PromDA relies on a frozen auto-regressive or sequence-to-sequence PLM (e.g., T5-Large with dimensional hidden states and layers), augmented with trainable continuous prompt vectors. At each transformer layer , a set of prompt vectors () is prepended to the input sequence. During the forward pass, these prompt vectors are concatenated ahead of the ordinary token embeddings for each layer, so the attention and output computation flows through at layer . Only the prompt vectors are updated during training, while the PLM parameters remain fixed. The total parameter count is limited (e.g., learnable variables per view).
Prompt initialization uses a dedicated “Synonym Keywords → Sentence” pre-training task. Here, prompts are trained (frozen PLM) to recover original text snippets from RAKE-extracted keyword sets, further randomized via synonym replacement. This pre-training initializes the prompt parameters before task-specific tuning, mitigating overfitting in extreme low-shot scenarios.
2. Dual-View Synthetic Data Generation
PromDA employs two distinct prompt-tuning pathways—Output View (OV) and Input View (IV)—to maximize generative diversity:
- Output View (OV): Conditioned on output tags or class labels. Prompts 0 are trained so that a label sequence or class token generates a plausible utterance reflecting those semantics.
- Input View (IV): Conditioned on a minimal set of salient keywords, extracted from real text via RAKE. Prompts 1 generate a naturalistic utterance given a keyword bag.
The synthetic data generation procedure involves:
- Prompt-tune both IV and OV models on the few-shot dataset 2.
- For each original instance, generate multiple synthetic outputs conditioned on both views via nucleus sampling (3).
- Cross-feed outputs to introduce further diversity: e.g., run OV prompts on IV outputs and vice versa.
- Aggregate all generated sets (4) as a raw synthetic pool.
This dual-view plus cross-feed yields hundreds of unique synthetic data points from each few-shot seed example.
3. Quality Filtering via Iterative Consistency Check
The synthetic sample pool exhibits significant variability, with potential for unfaithful or label-mismatched instances. PromDA mitigates this via iterative NLU-based filtering:
- At round 5, an NLU model 6 predicts a label for each generated 7 pair.
- Only those samples where 8’s highest-probability label 9 matches the generator-assigned label 0 are retained:
1
- 2 is fine-tuned on 3, and the process iterates (typically 4 suffices).
This mechanism rapidly elevates the average quality and label consistency of synthetic data.
4. Experimental Setup and Benchmarks
PromDA is evaluated on classic low-resource sequence labeling (CoNLL-03 NER, Wikiann NER) and sentence classification (SST-2, RT) benchmarks. Shot-5 regimes denote 6 labeled examples per output label, with metrics averaged across 5 random seeds (micro-F1). Model backbone for generation is T5-Large (frozen); for NLU filtering, BERT-BASE is leveraged.
The overall training pipeline includes:
- Prompt pre-training for initial soft prompt vectors (7100k steps on C4).
- Few-shot fine-tuning for prompt models (1k–5k steps).
- Synthetic data generation with heavy duplication and nucleus sampling.
- Downstream BERT NLU training with Adam optimizer.
- Consistency-based iterative filtering of generated pool.
5. Performance and Comparative Analysis
Tables below summarize PromDA's gains:
| Benchmark | C03 (10) | Wikiann (10) | SST-2 (10) | RT (10) |
|---|---|---|---|---|
| Baseline | 72.7 | 50.8 | 66.1 | 57.8 |
| PromDA | 77.5 | 58.3 | 81.4 | 73.4 |
| 8 | +4.8 | +7.5 | +15.3 | +15.6 |
| Benchmark | C03 (100) | Wikiann (100) | SST-2 (100) | RT (100) |
|---|---|---|---|---|
| Baseline | 77.8 | 56.1 | 71.7 | 65.4 |
| PromDA | 80.1 | 65.1 | 83.2 | 75.4 |
| 9 | +2.3 | +9.0 | +11.5 | +10.0 |
Compared to classical augmentation, full PLM fine-tuning (LAMBADA), and state-of-the-art semi-supervised methods (MetaST), PromDA achieves higher accuracy without any actual unlabeled in-domain data. When used in tandem with self-training on real data, PromDA yields additive performance gains.
6. Synthetic Output Quality and Complementarity
PromDA-generated samples are semantically and structurally novel relative to the original data, especially after cross-view generation. In sequence labeling, generated utterances include new entity permutations and surface forms; in sentiment classification, new compositional and syntactic variants arise.
Combining PromDA synthetic data with in-domain unlabeled sample self-training produces further F1 increases, indicating that the diversity and structure of prompt-generated synthetic examples are not subsumed by standard self-training methods.
7. Strengths, Limitations, and Outlook
Strengths:
- No requirement for external (real) unlabeled data; all augmentation is derived from a frozen PLM and learned prompts.
- Parameter-efficient: only 0 parameters trained per prompt set.
- Effective dual-view, cross-fed generation strategy leads to high diversity and coverage.
- Iterative NLU filtering rapidly eliminates spurious generations.
- Synthetic and real-data-augmented signals can be combined for maximal performance.
Limitations:
- Substantial pre-training cost to initialize prompt vectors (about 100k steps on C4).
- Focused exclusively on T5-Large (generation) and BERT-BASE (NLU); extension to other architectures remains unstudied.
- Scope limited to NER and sentiment classification, not covering more complex NLU or QA; unknown transferability.
- The iterative scheme depends on the accuracy of the filtering NLU model in early rounds, and might be less robust if this model is weak.
In summary, PromDA operationalizes the principle that soft prompt-tuned generation—especially under multi-view, heavily filtered, and parameter-efficient regimes—can deliver strong, diverse data augmentation for NLU, outperforming both legacy and recent semi-supervised approaches particularly in the low-resource and few-shot settings (Wang et al., 2022).