UniPrompt: Task Facet Learning for LLM Prompt Optimization

Updated 4 December 2025

The paper introduces UniPrompt, a method that decomposes prompts into distinct semantic facets, achieving significant accuracy gains over traditional approaches.
It employs clustering techniques and a two-tier feedback mechanism to iteratively refine task-specific prompt sections for optimal performance.
Empirical results across benchmarks like Ethos, ARC-Challenge, and GSM8K demonstrate UniPrompt’s effectiveness in enhancing LLM accuracy.

Task Facet Learning (UniPrompt) is a structured approach to prompt optimization for LLMs that decomposes the generation of effective prompts into the synthesis of multiple, loosely-coupled semantic facets. Rather than relying solely on iterative editing or automatic selection of in-context examples, Task Facet Learning explicitly discovers, organizes, and refines task-relevant sections within the prompt through clustering of training examples and a two-tiered LLM feedback mechanism. The UniPrompt algorithm implements this paradigm and has demonstrated superior accuracy compared to both human-tuned and state-of-the-art automated methods, especially for complex and long prompts (Juneja et al., 15 Jun 2024).

1. Formal Framework and Optimization Objective

The prompt optimization problem in Task Facet Learning is formalized as follows. Given:

A black-box LLM solver function $f_{\rm LLM}\colon\mathcal X\to[0,1]$ , mapping any prompt $P\in\mathcal X$ to its accuracy on a validation set $D_v$ .
A small, labeled training set $D_t = \{ (q_i, a_i) \}_{i=1}^N$ .
A one-line high-level task description $T$ .

The objective is to construct a prompt $P$ that maximizes validation accuracy: $P^* = \arg\max_{P \in \mathcal X} f_{\rm LLM}(P; D_v)$

UniPrompt posits that effective prompts can be written as a concatenation of $m$ loosely coupled semantic sections or facets: $P = [S_1\ \|\ S_2\ \|\ \cdots\ \|\ S_m]$ where each section $S_j$ addresses a distinct aspect of the target task (e.g., definitions, edge cases, analogies, reasoning strategies). The algorithm’s optimization problem is then: $\max_{m,\;S_1,\dots,S_m} f_{\rm LLM}([S_1\|\cdots\|S_m]; D_v)$

UniPrompt imposes a structured format on candidate prompts, organizing them into semantic blocks that represent discovered task facets. The types of sections commonly realized and refined within UniPrompt include:

Introduction or Task Description
Definitions or Formal Criteria
Key Examples or Illustrations
Common Pitfalls / Counter-Examples
Reasoning Strategies (e.g., “chain of thought”)
Analogies or Intuitions

The precise set of facets is not fixed but is discovered and evolved automatically based on the clustering and feedback cycle. Empirically, constructing prompts via this structure shows diminishing marginal utility with respect to additional content, in contrast to simply appending more in-context examples.

Facets are uncovered by clustering the training examples such that each cluster reflects shared failure modes or subtopics likely corresponding to distinct semantic facets. UniPrompt supports two clustering mechanisms:

Topic-Based Clustering: An “expert” LLM, such as GPT-4, assigns each question $q_i$ to a rough sub-topic $t_i$ . The set $\{ t_i \}$ is then partitioned by the LLM into $l$ clusters $\{ C_1,\dots, C_l \}$ .
Feedback-Based Clustering: The current prompt $P$ is evaluated on all $(q_i, a_i)$ . For each prediction error, the expert LLM generates a short feedback string $F_i$ , which is then clustered into groups, yielding clusters $C_j$ aggregating examples with similar failure feedback. Mathematically,

$C = \{C_1, \ldots, C_l\}, \quad \bigcup_j C_j = D_t, \quad C_j \cap C_{j'} = \emptyset \;\; (j \neq j')$

Each cluster $C_j$ may be further subdivided into mini-batches $m_{j,1}, m_{j,2}, \ldots$ to better localize feedback and aggregate common missing facets.

4. Two-Tier Feedback and Iterative Prompt Synthesis

The feedback mechanism in UniPrompt consists of a two-tier process within each cluster $C_j$ :

Mini-Batch Feedback: The solver LLM generates answers and chain-of-thoughts for a mini-batch $m \subseteq C_j$ using the current prompt $P_t$ . All incorrect cases are provided to the expert LLM, which proposes a single edit or new section $f_m$ meant to address the shared failure (allowed operations: add, edit, or delete a section/subsection).
Batch-Level Aggregation: The mini-batch feedback proposals $\{f_m\}$ are aggregated by the expert LLM into a concise cluster-level edit $F_j$ . To encourage generalizability, aggregation includes a small number of random out-of-cluster errors.

Edits are applied to the prompt ( $\texttt{apply}(P_t, F_j)$ ), generating new candidate prompts $\{Q^{(1)}, Q^{(2)}, \dots \}$ , which are evaluated on a held-out set or the originating cluster. The prompt update rule uses either a greedy strategy: $P_{t+1} = \arg\max_{Q \in \{\texttt{apply}(P_t,F_j)\} \cup \{P_t\}} f_{\rm LLM}(Q; D_v)$ or a beam search maintaining the top- $k$ prompts.

5. End-to-End UniPrompt Workflow

UniPrompt’s workflow is as follows:

Initialize a baseline prompt $P_0$ from the task description $T$ or a finetuned “Structured LM.”
Cluster the training set $D_t$ into $\{C_1, \dots, C_l\}$ .
Maintain a candidate beam of size $k$ , initially $\{P_0\}$ .
For each cluster $C_j$ in a round-robin or epoch manner: a. Partition $C_j$ into mini-batches $\{ m_{j,1}, \dots \}$ . b. For each mini-batch, gather feedback and aggregate. c. Apply aggregated edit $F_j$ to all prompts in the beam, generate new candidates, evaluate, and update to the top- $k$ . d. Optionally, early stop if no validation improvement after a fixed patience interval.
After a predetermined number of epochs or convergence, return the highest-performing prompt $P^*$ in the beam.

A typical UniPrompt run with $k=2$ , mini-batch size 4, $l=10$ clusters, and 20 epochs, requires 5–7 hours of LLM calls at 0.5 qps.

6. Empirical Results and Performance Comparison

UniPrompt has been evaluated in both zero-shot and few-shot regimes on the following benchmarks:

Ethos (hate speech classification)
ARC-Challenge (science question answering)
MedQA (medical multiple-choice)
GSM8K (grade-school math word problems)

Using GPT-3.5-Turbo as the solver LLM (with chain-of-thought) and GPT-4 for expert feedback, UniPrompt yielded these zero-shot results (accuracy in %):

Method	Ethos	ARC Ch.	MedQA	GSM8K
Expert Prompt	74.1	78.4	53.1	78.9
CoT	72.0	79.4	50.3	76.3
ProTeGi	76.0	78.8	52.9	77.3
Evoke	63.5	89.0	52.8	81.0
EvoPrompt	81.6	89.9	50.3	81.4
UniPrompt (greedy)	93.7	90.5	55.5	82.3
UniPrompt (beam)	92.3	86.0	57.1	82.4

On Ethos, UniPrompt achieved an absolute improvement of $+18\%$ over expert-tuned prompts. On ARC-Challenge and GSM8K, UniPrompt matches or outperforms the best prior automated prompting methods.

In few-shot settings (using GPT-4 solver) against MedPrompt, UniPrompt matched MedPrompt’s performance without additional inference calls. With added kNN in-context selection, chain-of-thought, and ensembling, UniPrompt exceeded MedPrompt’s accuracy by up to 4 points on average on MedQA, PubMedQA, MedMCQA, and MMLU-MG.

A real-world search query intent task further demonstrated UniPrompt’s capability, where the learned prompt improved negative-class accuracy by 5.8 percentage points and overall accuracy by 1.9 percentage points compared to a highly-tuned manual prompt.

Examples of UniPrompt-discovered prompt sections:

Ethos (Hate Speech) Prompt:
- Introduction: “Determine whether the following text is hate speech or not.”
- Definition: “Hate speech targets a protected group with intent to demean or incite violence. It differs from mere profanity or insult…”
- Key Facets / Pitfalls:
- “Beware of satire/irony that may not actually target a group.”
- “Differentiate between negative opinions (‘I dislike X’) and calls to harm.”
- “Check context: sometimes neutral words become hateful if used in a broader hateful narrative.”
GSM8K (Math Word Problems) Prompt:
- Introduction: “You are given a math word problem; solve it step by step.”
- Understanding Quantities: “Read each sentence carefully; identify whether quantities are static or per-unit over time.”
- Averaging Strategy: “To compute averages, sum all relevant values then divide by the number of periods; confirm units.”
- Past vs Future Events: “Distinguish starting vs ending points to compute durations correctly.”

These prompt sections originated from clusters of training examples exhibiting shared failure modes, as identified by clustered feedback. For instance, the “Pitfalls” section in the Ethos prompt arose due to a cluster consistently yielding incorrect predictions; aggregation of feedback from the corresponding mini-batches resulted in this new task facet. Similarly, the math-strategy blocks in GSM8K stemmed from clusters where the model failed on problems relating to temporal quantities and averaging, driving the creation of explicit sections addressing these facets.

Together, these results support the efficacy of Task Facet Learning through structured prompt sections, example clustering, and iterative LLM-guided feedback. The approach enables the synthesis of complex, interpretable prompts that are empirically superior to both manual and automated baselines, especially for challenging tasks where multiple semantic facets are critical (Juneja et al., 15 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

Task Facet Learning: A Structured Approach to Prompt Optimization (2024)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Task Facet Learning (UniPrompt).

UniPrompt: Task Facet Learning for LLM Prompt Optimization

1. Formal Framework and Optimization Objective

2. Semantic Structure and Facet Discovery

3. Clustering Mechanisms for Facet Elicitation

4. Two-Tier Feedback and Iterative Prompt Synthesis

5. End-to-End UniPrompt Workflow

6. Empirical Results and Performance Comparison

7. Illustrative Prompt Excerpts and Facet Genesis

Whiteboard

Follow Topic

Continue Learning

UniPrompt: Task Facet Learning for LLM Prompt Optimization

1. Formal Framework and Optimization Objective

2. Semantic Structure and Facet Discovery

3. Clustering Mechanisms for Facet Elicitation

4. Two-Tier Feedback and Iterative Prompt Synthesis

5. End-to-End UniPrompt Workflow

6. Empirical Results and Performance Comparison

7. Illustrative Prompt Excerpts and Facet Genesis

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics