UniPrompt: Task Facet Learning for LLM Prompt Optimization
- The paper introduces UniPrompt, a method that decomposes prompts into distinct semantic facets, achieving significant accuracy gains over traditional approaches.
- It employs clustering techniques and a two-tier feedback mechanism to iteratively refine task-specific prompt sections for optimal performance.
- Empirical results across benchmarks like Ethos, ARC-Challenge, and GSM8K demonstrate UniPrompt’s effectiveness in enhancing LLM accuracy.
Task Facet Learning (UniPrompt) is a structured approach to prompt optimization for LLMs that decomposes the generation of effective prompts into the synthesis of multiple, loosely-coupled semantic facets. Rather than relying solely on iterative editing or automatic selection of in-context examples, Task Facet Learning explicitly discovers, organizes, and refines task-relevant sections within the prompt through clustering of training examples and a two-tiered LLM feedback mechanism. The UniPrompt algorithm implements this paradigm and has demonstrated superior accuracy compared to both human-tuned and state-of-the-art automated methods, especially for complex and long prompts (Juneja et al., 15 Jun 2024).
1. Formal Framework and Optimization Objective
The prompt optimization problem in Task Facet Learning is formalized as follows. Given:
- A black-box LLM solver function , mapping any prompt to its accuracy on a validation set .
- A small, labeled training set .
- A one-line high-level task description .
The objective is to construct a prompt that maximizes validation accuracy:
UniPrompt posits that effective prompts can be written as a concatenation of loosely coupled semantic sections or facets: where each section addresses a distinct aspect of the target task (e.g., definitions, edge cases, analogies, reasoning strategies). The algorithm’s optimization problem is then:
2. Semantic Structure and Facet Discovery
UniPrompt imposes a structured format on candidate prompts, organizing them into semantic blocks that represent discovered task facets. The types of sections commonly realized and refined within UniPrompt include:
- Introduction or Task Description
- Definitions or Formal Criteria
- Key Examples or Illustrations
- Common Pitfalls / Counter-Examples
- Reasoning Strategies (e.g., “chain of thought”)
- Analogies or Intuitions
The precise set of facets is not fixed but is discovered and evolved automatically based on the clustering and feedback cycle. Empirically, constructing prompts via this structure shows diminishing marginal utility with respect to additional content, in contrast to simply appending more in-context examples.
3. Clustering Mechanisms for Facet Elicitation
Facets are uncovered by clustering the training examples such that each cluster reflects shared failure modes or subtopics likely corresponding to distinct semantic facets. UniPrompt supports two clustering mechanisms:
- Topic-Based Clustering: An “expert” LLM, such as GPT-4, assigns each question to a rough sub-topic . The set is then partitioned by the LLM into clusters .
- Feedback-Based Clustering: The current prompt is evaluated on all . For each prediction error, the expert LLM generates a short feedback string , which is then clustered into groups, yielding clusters aggregating examples with similar failure feedback. Mathematically,
Each cluster may be further subdivided into mini-batches to better localize feedback and aggregate common missing facets.
4. Two-Tier Feedback and Iterative Prompt Synthesis
The feedback mechanism in UniPrompt consists of a two-tier process within each cluster :
- Mini-Batch Feedback: The solver LLM generates answers and chain-of-thoughts for a mini-batch using the current prompt . All incorrect cases are provided to the expert LLM, which proposes a single edit or new section meant to address the shared failure (allowed operations: add, edit, or delete a section/subsection).
- Batch-Level Aggregation: The mini-batch feedback proposals are aggregated by the expert LLM into a concise cluster-level edit . To encourage generalizability, aggregation includes a small number of random out-of-cluster errors.
Edits are applied to the prompt (), generating new candidate prompts , which are evaluated on a held-out set or the originating cluster. The prompt update rule uses either a greedy strategy: or a beam search maintaining the top- prompts.
5. End-to-End UniPrompt Workflow
UniPrompt’s workflow is as follows:
- Initialize a baseline prompt from the task description or a finetuned “Structured LM.”
- Cluster the training set into .
- Maintain a candidate beam of size , initially .
- For each cluster in a round-robin or epoch manner: a. Partition into mini-batches . b. For each mini-batch, gather feedback and aggregate. c. Apply aggregated edit to all prompts in the beam, generate new candidates, evaluate, and update to the top-. d. Optionally, early stop if no validation improvement after a fixed patience interval.
- After a predetermined number of epochs or convergence, return the highest-performing prompt in the beam.
A typical UniPrompt run with , mini-batch size 4, clusters, and 20 epochs, requires 5–7 hours of LLM calls at 0.5 qps.
6. Empirical Results and Performance Comparison
UniPrompt has been evaluated in both zero-shot and few-shot regimes on the following benchmarks:
- Ethos (hate speech classification)
- ARC-Challenge (science question answering)
- MedQA (medical multiple-choice)
- GSM8K (grade-school math word problems)
Using GPT-3.5-Turbo as the solver LLM (with chain-of-thought) and GPT-4 for expert feedback, UniPrompt yielded these zero-shot results (accuracy in %):
| Method | Ethos | ARC Ch. | MedQA | GSM8K |
|---|---|---|---|---|
| Expert Prompt | 74.1 | 78.4 | 53.1 | 78.9 |
| CoT | 72.0 | 79.4 | 50.3 | 76.3 |
| ProTeGi | 76.0 | 78.8 | 52.9 | 77.3 |
| Evoke | 63.5 | 89.0 | 52.8 | 81.0 |
| EvoPrompt | 81.6 | 89.9 | 50.3 | 81.4 |
| UniPrompt (greedy) | 93.7 | 90.5 | 55.5 | 82.3 |
| UniPrompt (beam) | 92.3 | 86.0 | 57.1 | 82.4 |
On Ethos, UniPrompt achieved an absolute improvement of over expert-tuned prompts. On ARC-Challenge and GSM8K, UniPrompt matches or outperforms the best prior automated prompting methods.
In few-shot settings (using GPT-4 solver) against MedPrompt, UniPrompt matched MedPrompt’s performance without additional inference calls. With added kNN in-context selection, chain-of-thought, and ensembling, UniPrompt exceeded MedPrompt’s accuracy by up to 4 points on average on MedQA, PubMedQA, MedMCQA, and MMLU-MG.
A real-world search query intent task further demonstrated UniPrompt’s capability, where the learned prompt improved negative-class accuracy by 5.8 percentage points and overall accuracy by 1.9 percentage points compared to a highly-tuned manual prompt.
7. Illustrative Prompt Excerpts and Facet Genesis
Examples of UniPrompt-discovered prompt sections:
- Ethos (Hate Speech) Prompt:
- Introduction: “Determine whether the following text is hate speech or not.”
- Definition: “Hate speech targets a protected group with intent to demean or incite violence. It differs from mere profanity or insult…”
- Key Facets / Pitfalls:
- “Beware of satire/irony that may not actually target a group.”
- “Differentiate between negative opinions (‘I dislike X’) and calls to harm.”
- “Check context: sometimes neutral words become hateful if used in a broader hateful narrative.”
- GSM8K (Math Word Problems) Prompt:
- Introduction: “You are given a math word problem; solve it step by step.”
- Understanding Quantities: “Read each sentence carefully; identify whether quantities are static or per-unit over time.”
- Averaging Strategy: “To compute averages, sum all relevant values then divide by the number of periods; confirm units.”
- Past vs Future Events: “Distinguish starting vs ending points to compute durations correctly.”
These prompt sections originated from clusters of training examples exhibiting shared failure modes, as identified by clustered feedback. For instance, the “Pitfalls” section in the Ethos prompt arose due to a cluster consistently yielding incorrect predictions; aggregation of feedback from the corresponding mini-batches resulted in this new task facet. Similarly, the math-strategy blocks in GSM8K stemmed from clusters where the model failed on problems relating to temporal quantities and averaging, driving the creation of explicit sections addressing these facets.
Together, these results support the efficacy of Task Facet Learning through structured prompt sections, example clustering, and iterative LLM-guided feedback. The approach enables the synthesis of complex, interpretable prompts that are empirically superior to both manual and automated baselines, especially for challenging tasks where multiple semantic facets are critical (Juneja et al., 15 Jun 2024).