Demystifying Prompts in Language Models via Perplexity Estimation (2212.04037v2)

Published 8 Dec 2022 in cs.CL

Abstract: LLMs can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is coupled with the extent to which the model is familiar with the language it contains. Over a wide range of tasks, we show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task. As a result, we devise a method for creating prompts: (1) automatically extend a small seed set of manually written prompts by paraphrasing using GPT3 and backtranslation and (2) choose the lowest perplexity prompts to get significant gains in performance.

PDF Abstract

Analyzing Prompt Performance in LLMs Through Perplexity

The paper "Demystifying Prompts in LLMs via Perplexity Estimation" presents a detailed empirical investigation into the factors affecting the performance variance of prompts used in LLMs (LMs). The authors challenge a pivotal aspect of natural language processing, which is understanding why similar prompts can yield significantly different performance outcomes when used for zero- and few-shot learning tasks.

A core hypothesis presented in the paper posits that the effectiveness of a prompt correlates with its perplexity — a metric that estimates how well a LLM anticipates or understands the language used in the prompt. The authors assert that a lower perplexity is indicative of higher prompt familiarity derived from training data, thus resulting in higher effectiveness for task performance across a spectrum of varied tasks and model architectures.

Key findings from the paper reveal:

Correlation Analysis: The paper reports statistically significant negative correlations between prompt perplexity and performance across multiple models — OPT (1.3b, 30b, 175b parameters) and Bloom (176b parameters) — and for varied tasks such as antonym prediction, word-level translation, and several classification tasks. Notably, the OPT 175b exhibited correlations up to -0.81 for specific tasks, suggesting a robust link between lower perplexity and improved task performance.
Prompts Expansion Method: To systematically evaluate the perplexity hypothesis, the authors developed an automatic prompt generation technique that employs paraphrasing via GPT3 and backtranslation. This method enriched the diversity of prompts, providing a comprehensive assessment pool that revealed marked improvements when selecting prompts based on perplexity.
SPELL Method: The paper introduces SPELL (Selecting Prompts by Estimating LM Likelihood), a method designed to improve prompt selection efficiency by leveraging the perplexity measure. SPELL facilitates prompt selection through a pragmatic approach that both reduces the need for human intervention and excludes reliance on labeled data, thereby streamlining the process of finding high-performing prompts.
Experimental Validation: Empirical validation evidenced that SPELL enhances average performance across tasks by 1.8 accuracy points with OPT and 3.6 points with Bloom compared to manual prompts, demonstrating its usefulness in practical scenarios. Furthermore, this method maintains stability, showing lower variability among selected prompts.
Implications and Future Directions: The findings suggest that perplexity could serve as an intrinsic indicator for user-friendly prompt engineering, promoting model-specific optimization without exhaustive reliance on extensive labeled datasets. The approach fosters advancements in AI, enabling streamlined integration and improved performance of LMs in real-world applications where manual prompt crafting is impractical.

In conclusion, the research illuminates integral aspects of prompt engineering, asserting that low perplexity remains a predictive indicator of prompt efficiency in LLMs. By validating the perplexity hypothesis across diverse tasks and models, the paper offers strategic insights for optimizing prompt use, encouraging further exploration into automated prompt design and its theoretical underpinnings in computational linguistics.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Hila Gonen (30 papers)
Srini Iyer (8 papers)
Terra Blevins (20 papers)
Noah A. Smith (224 papers)
Luke Zettlemoyer (225 papers)

Citations (165)

View on Semantic Scholar

Related Papers

Find Related Papers

Demystifying Prompts in Language Models via Perplexity Estimation (2212.04037v2)

Analyzing Prompt Performance in LLMs Through Perplexity

Related Papers

Tweets