Abstract
Universal Self-Adaptive Prompting (USP) presents an approach designed to bolster zero-shot learning capabilities of LLMs. Traditional in-context learning (ICL) shows remarkable zero-shot abilities but often suffers from a lack of guidance when real-world labels are scarce. USP overcomes these limitations by using models' output as pseudo-demonstrations for ICL. The model requires minimal unlabeled data and works in inference mode, displaying flexibility in a variety of NLP tasks. Notably, USP distinguishes between task types—Classification (CLS), Short-form Generation (SFG), and Long-form Generation (LFG)—applying tailored selection mechanisms to derive quality pseudo-demonstrations. Its performance was assessed across multiple tasks with PaLM and PaLM 2 models.
Preliminaries
USP builds on ICL principles by leveraging generated outputs as pseudo-demonstrations for zero-shot ICL, modifying the test query with these demonstrations. The process involves two stages: initial zero-shot prompting to create potential pseudo-demos and a follow-up step that appends these to the query for improved prediction. This method is also adjacently influenced by the concept of self-consistency, where an LLM decodes a query multiple times to generate diverse predictions and uses the majority as the final output.
Universal Self-Adaptive Prompting
USP innovatively applies a task-specific heuristic for pseudo-demo selection, informed by the need to adapt confidence metrics to varying task objectives. Given the differing nature of responses across task types, USP introduces category-specific scoring functions. For example, in CLS tasks, USP harnesses the negative entropy of probablity distributions over class labels to estimate confidence. SFG tasks, which entail diverse correct responses, necessitate multiple decoding rounds and an entropy-based scoring function, while LFG tasks rely on pairwise metric evaluations due to the high variability in responses. Importantly, USP adapts to these challenges while maintaining efficiency, requiring a small subset of the test set for pseudo-demo generation.
Evaluation and Results
Comparative analysis against standard baselines reveals that USP not only surpasses typical zero-shot prompting on over 40 tasks but also competes favorably with few-shot baselines. This demonstrates its efficacy in improving zero-shot generalizability using modest amounts of unlabeled data.
Key findings include its superior performance on generative tasks and consistency with larger or more advanced LLMs. The USP score correlates positively with ground-truth performance, indicating its effectiveness in identifying high-quality pseudo-demos, despite occasional underperformance against zero-shot baselines. This paper postulates that the amplitude of USP's benefit is proportionate to the model's uncertainty in zero-shot conditions, where increased guidance is most needed. The findings suggest the applicability of USP in cost-effectively enhancing zero-shot learning on a wide range of NLP tasks.