This paper investigates how the linguistic properties of prompts affect the performance of LLMs in zero-shot task execution (Leidinger et al., 2023 ). The core goal is to understand if performance correlates with intuitive linguistic features like simplicity, low perplexity, or high frequency, which are often assumed to be beneficial.
Methodology
To achieve this, the researchers conducted a controlled paper using semantically equivalent prompts that varied systematically in their linguistic structure.
- Linguistic Variations: They manually created parallel sets of 550 prompts for three tasks, altering one linguistic property at a time:
- Grammatical Mood: Interrogative (questions), Imperative (orders), Indicative (statements).
- Grammatical Aspect: Active vs. Passive voice.
- Grammatical Tense: Past, Present, Future.
- Modality: Using different modal verbs (can, could, may, might, must, should, would).
- Lexico-Semantic (Synonymy): Replacing content words (e.g., 'review' in sentiment analysis, 'correct'/'answer' in QA) with synonyms of varying frequency and ambiguity.
- Models: Five decoder-only LLMs of different sizes and types were evaluated:
- Pretrained-only: LLaMA 30b, OPT 1.3b, OPT 30b.
- Instruction-tuned: OPT-IML 1.3b, OPT-IML 30b (tuned on tasks including some used in the evaluation).
- Tasks & Datasets: Experiments covered three NLP tasks using six datasets:
- Sentiment Classification: SST-2, IMDB.
- Natural Language Inference (NLI): RTE, CB.
- Question Answering (QA): BoolQ, ARC-E.
- These datasets represented seen (supervised), cross-dataset, and unseen (cross-task) scenarios for the instruction-tuned models.
- Prompting Setup:
- Zero-Shot: No examples were provided in the prompt.
- Label Mapping: Task labels were mapped to simple target words (e.g., "yes"/"no" for sentiment/NLI, "A"/"B"/"C"/"D" for QA).
- Prediction: The model's prediction was determined by the target word assigned the highest log probability, regardless of its overall rank in the vocabulary.
- Postamble: A fixed phrase like "Choices: yes or no? Answer:" was added to each prompt to aid performance and reduce surface-form competition.
- Evaluation: Accuracy was measured on 500 random samples per dataset for each prompt.
Key Findings
The paper revealed significant performance instability and contradicted several common assumptions about prompt engineering:
- High Performance Variability: Minor linguistic changes in prompts led to substantial and often unpredictable variations in accuracy across all models and tasks (e.g., differences up to 10-17 percentage points observed just by changing a modal verb or synonym).
- No Consistent Linguistic Preference:
- Models didn't universally prefer questions (interrogative) over orders (imperative), or vice versa. Indicative prompts sometimes performed best.
- Active voice was not consistently better than passive voice; passive prompts sometimes yielded superior results.
- Present tense was not universally optimal, even for instruction-tuned models on seen tasks; past or future tense prompts occasionally performed much better.
- Replacing common words (like 'review' or 'answer') with rarer or more complex synonyms (like 'appraisal' or 'appropriate') often improved performance, even for models instruction-tuned on prompts containing the common words.
- Poor Prompt Transferability: Prompts optimized for one model/dataset combination performed poorly when transferred to other models or datasets, often showing drops of over 20 percentage points. This undermines the idea of "universal" best prompts.
- Instruction-Tuning is Not a Panacea: While instruction-tuning generally improved average performance and reduced variability compared to base models of the same size, significant instability remained (e.g., ranges of 5-12 percentage points on seen tasks for OPT-IML 30b).
- Model Size Doesn't Guarantee Stability: Larger models did not consistently show less performance variation due to prompt changes than smaller models.
- Performance Not Explained by Simple Metrics: The observed accuracy variations did not significantly correlate with:
- Prompt Perplexity: Lower perplexity prompts did not consistently lead to higher accuracy, contradicting some prior findings. In some cases, higher perplexity correlated with better performance.
- Word Frequency: Using more frequent synonyms did not reliably improve results.
- Word Sense Ambiguity: Using less ambiguous synonyms did not reliably improve results.
- Prompt Length: No consistent correlation was found between prompt length (in tokens) and accuracy.
Practical Implications & Recommendations
The findings have significant implications for practitioners using LLMs:
- Prompting is Unstable: Relying on a single prompt for evaluation or deployment is risky, as performance can be highly sensitive to subtle wording changes.
- Challenge Assumptions: Do not assume that simpler, lower-perplexity, or more frequent language will automatically yield the best prompts. Empirical testing is crucial.
- Current Benchmarking is Flawed: Evaluating models using single, often undisclosed prompts makes results unreliable, difficult to reproduce, and comparisons unfair.
- Need for Robust Evaluation: The paper proposes a more rigorous evaluation framework:
- Use Diverse Prompt Sets: Evaluate models on a large set of prompts covering linguistic variations (e.g., using controlled paraphrasing, synonym replacement).
- Report Variability: Report mean and variance (or range) of performance across the prompt set to give a clearer picture of model robustness.
- Treat Prompts as Hyperparameters: Select the best prompt(s) for a specific model/dataset combination using a held-out development set.
- Standardize Reporting: Include metrics characterizing the prompt set (perplexity, ambiguity distributions) and analyze their correlation with performance (e.g., using correlation coefficients or mixed effects models).
Conclusion
LLM performance is highly sensitive to the linguistic structure of prompts in complex ways not captured by simple metrics like perplexity or word frequency. This highlights the instability of current prompting practices and underscores the need for more comprehensive and transparent evaluation methodologies that account for linguistic variability to reliably assess and compare LLM capabilities.