The language of prompting: What linguistic properties make a prompt successful? (2311.01967v1)

Published 3 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.

PDF Abstract

This paper investigates how the linguistic properties of prompts affect the performance of LLMs in zero-shot task execution (Leidinger et al., 2023 ). The core goal is to understand if performance correlates with intuitive linguistic features like simplicity, low perplexity, or high frequency, which are often assumed to be beneficial.

Methodology

To achieve this, the researchers conducted a controlled paper using semantically equivalent prompts that varied systematically in their linguistic structure.

Linguistic Variations: They manually created parallel sets of 550 prompts for three tasks, altering one linguistic property at a time:
- Grammatical Mood: Interrogative (questions), Imperative (orders), Indicative (statements).
- Grammatical Aspect: Active vs. Passive voice.
- Grammatical Tense: Past, Present, Future.
- Modality: Using different modal verbs (can, could, may, might, must, should, would).
- Lexico-Semantic (Synonymy): Replacing content words (e.g., 'review' in sentiment analysis, 'correct'/'answer' in QA) with synonyms of varying frequency and ambiguity.
Models: Five decoder-only LLMs of different sizes and types were evaluated:
- Pretrained-only: LLaMA 30b, OPT 1.3b, OPT 30b.
- Instruction-tuned: OPT-IML 1.3b, OPT-IML 30b (tuned on tasks including some used in the evaluation).
Tasks & Datasets: Experiments covered three NLP tasks using six datasets:
- Sentiment Classification: SST-2, IMDB.
- Natural Language Inference (NLI): RTE, CB.
- Question Answering (QA): BoolQ, ARC-E.
- These datasets represented seen (supervised), cross-dataset, and unseen (cross-task) scenarios for the instruction-tuned models.
Prompting Setup:
- Zero-Shot: No examples were provided in the prompt.
- Label Mapping: Task labels were mapped to simple target words (e.g., "yes"/"no" for sentiment/NLI, "A"/"B"/"C"/"D" for QA).
- Prediction: The model's prediction was determined by the target word assigned the highest log probability, regardless of its overall rank in the vocabulary.
- Postamble: A fixed phrase like "Choices: yes or no? Answer:" was added to each prompt to aid performance and reduce surface-form competition.
- Evaluation: Accuracy was measured on 500 random samples per dataset for each prompt.

Key Findings

The paper revealed significant performance instability and contradicted several common assumptions about prompt engineering:

High Performance Variability: Minor linguistic changes in prompts led to substantial and often unpredictable variations in accuracy across all models and tasks (e.g., differences up to 10-17 percentage points observed just by changing a modal verb or synonym).
No Consistent Linguistic Preference:
- Models didn't universally prefer questions (interrogative) over orders (imperative), or vice versa. Indicative prompts sometimes performed best.
- Active voice was not consistently better than passive voice; passive prompts sometimes yielded superior results.
- Present tense was not universally optimal, even for instruction-tuned models on seen tasks; past or future tense prompts occasionally performed much better.
- Replacing common words (like 'review' or 'answer') with rarer or more complex synonyms (like 'appraisal' or 'appropriate') often improved performance, even for models instruction-tuned on prompts containing the common words.
Poor Prompt Transferability: Prompts optimized for one model/dataset combination performed poorly when transferred to other models or datasets, often showing drops of over 20 percentage points. This undermines the idea of "universal" best prompts.
Instruction-Tuning is Not a Panacea: While instruction-tuning generally improved average performance and reduced variability compared to base models of the same size, significant instability remained (e.g., ranges of 5-12 percentage points on seen tasks for OPT-IML 30b).
Model Size Doesn't Guarantee Stability: Larger models did not consistently show less performance variation due to prompt changes than smaller models.
Performance Not Explained by Simple Metrics: The observed accuracy variations did not significantly correlate with:
- Prompt Perplexity: Lower perplexity prompts did not consistently lead to higher accuracy, contradicting some prior findings. In some cases, higher perplexity correlated with better performance.
- Word Frequency: Using more frequent synonyms did not reliably improve results.
- Word Sense Ambiguity: Using less ambiguous synonyms did not reliably improve results.
- Prompt Length: No consistent correlation was found between prompt length (in tokens) and accuracy.

Practical Implications & Recommendations

The findings have significant implications for practitioners using LLMs:

Prompting is Unstable: Relying on a single prompt for evaluation or deployment is risky, as performance can be highly sensitive to subtle wording changes.
Challenge Assumptions: Do not assume that simpler, lower-perplexity, or more frequent language will automatically yield the best prompts. Empirical testing is crucial.
Current Benchmarking is Flawed: Evaluating models using single, often undisclosed prompts makes results unreliable, difficult to reproduce, and comparisons unfair.
Need for Robust Evaluation: The paper proposes a more rigorous evaluation framework:
- Use Diverse Prompt Sets: Evaluate models on a large set of prompts covering linguistic variations (e.g., using controlled paraphrasing, synonym replacement).
- Report Variability: Report mean and variance (or range) of performance across the prompt set to give a clearer picture of model robustness.
- Treat Prompts as Hyperparameters: Select the best prompt(s) for a specific model/dataset combination using a held-out development set.
- Standardize Reporting: Include metrics characterizing the prompt set (perplexity, ambiguity distributions) and analyze their correlation with performance (e.g., using correlation coefficients or mixed effects models).

Conclusion

LLM performance is highly sensitive to the linguistic structure of prompts in complex ways not captured by simple metrics like perplexity or word frequency. This highlights the instability of current prompting practices and underscores the need for more comprehensive and transparent evaluation methodologies that account for linguistic variability to reliably assess and compare LLM capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Alina Leidinger (8 papers)
Robert van Rooij (5 papers)
Ekaterina Shutova (52 papers)

Citations (23)

View on Semantic Scholar

The language of prompting: What linguistic properties make a prompt successful? (2311.01967v1)

Related Papers