Papers
Topics
Authors
Recent
2000 character limit reached

UniPrompt: Task Facet Learning for LLM Prompt Optimization

Updated 4 December 2025
  • The paper introduces UniPrompt, a method that decomposes prompts into distinct semantic facets, achieving significant accuracy gains over traditional approaches.
  • It employs clustering techniques and a two-tier feedback mechanism to iteratively refine task-specific prompt sections for optimal performance.
  • Empirical results across benchmarks like Ethos, ARC-Challenge, and GSM8K demonstrate UniPrompt’s effectiveness in enhancing LLM accuracy.

Task Facet Learning (UniPrompt) is a structured approach to prompt optimization for LLMs that decomposes the generation of effective prompts into the synthesis of multiple, loosely-coupled semantic facets. Rather than relying solely on iterative editing or automatic selection of in-context examples, Task Facet Learning explicitly discovers, organizes, and refines task-relevant sections within the prompt through clustering of training examples and a two-tiered LLM feedback mechanism. The UniPrompt algorithm implements this paradigm and has demonstrated superior accuracy compared to both human-tuned and state-of-the-art automated methods, especially for complex and long prompts (Juneja et al., 15 Jun 2024).

1. Formal Framework and Optimization Objective

The prompt optimization problem in Task Facet Learning is formalized as follows. Given:

  • A black-box LLM solver function fLLM ⁣:X[0,1]f_{\rm LLM}\colon\mathcal X\to[0,1], mapping any prompt PXP\in\mathcal X to its accuracy on a validation set DvD_v.
  • A small, labeled training set Dt={(qi,ai)}i=1ND_t = \{ (q_i, a_i) \}_{i=1}^N.
  • A one-line high-level task description TT.

The objective is to construct a prompt PP that maximizes validation accuracy: P=argmaxPXfLLM(P;Dv)P^* = \arg\max_{P \in \mathcal X} f_{\rm LLM}(P; D_v)

UniPrompt posits that effective prompts can be written as a concatenation of mm loosely coupled semantic sections or facets: P=[S1  S2    Sm]P = [S_1\ \|\ S_2\ \|\ \cdots\ \|\ S_m] where each section SjS_j addresses a distinct aspect of the target task (e.g., definitions, edge cases, analogies, reasoning strategies). The algorithm’s optimization problem is then: maxm,  S1,,SmfLLM([S1Sm];Dv)\max_{m,\;S_1,\dots,S_m} f_{\rm LLM}([S_1\|\cdots\|S_m]; D_v)

2. Semantic Structure and Facet Discovery

UniPrompt imposes a structured format on candidate prompts, organizing them into semantic blocks that represent discovered task facets. The types of sections commonly realized and refined within UniPrompt include:

  • Introduction or Task Description
  • Definitions or Formal Criteria
  • Key Examples or Illustrations
  • Common Pitfalls / Counter-Examples
  • Reasoning Strategies (e.g., “chain of thought”)
  • Analogies or Intuitions

The precise set of facets is not fixed but is discovered and evolved automatically based on the clustering and feedback cycle. Empirically, constructing prompts via this structure shows diminishing marginal utility with respect to additional content, in contrast to simply appending more in-context examples.

3. Clustering Mechanisms for Facet Elicitation

Facets are uncovered by clustering the training examples such that each cluster reflects shared failure modes or subtopics likely corresponding to distinct semantic facets. UniPrompt supports two clustering mechanisms:

  • Topic-Based Clustering: An “expert” LLM, such as GPT-4, assigns each question qiq_i to a rough sub-topic tit_i. The set {ti}\{ t_i \} is then partitioned by the LLM into ll clusters {C1,,Cl}\{ C_1,\dots, C_l \}.
  • Feedback-Based Clustering: The current prompt PP is evaluated on all (qi,ai)(q_i, a_i). For each prediction error, the expert LLM generates a short feedback string FiF_i, which is then clustered into groups, yielding clusters CjC_j aggregating examples with similar failure feedback. Mathematically,

C={C1,,Cl},jCj=Dt,CjCj=    (jj)C = \{C_1, \ldots, C_l\}, \quad \bigcup_j C_j = D_t, \quad C_j \cap C_{j'} = \emptyset \;\; (j \neq j')

Each cluster CjC_j may be further subdivided into mini-batches mj,1,mj,2,m_{j,1}, m_{j,2}, \ldots to better localize feedback and aggregate common missing facets.

4. Two-Tier Feedback and Iterative Prompt Synthesis

The feedback mechanism in UniPrompt consists of a two-tier process within each cluster CjC_j:

  • Mini-Batch Feedback: The solver LLM generates answers and chain-of-thoughts for a mini-batch mCjm \subseteq C_j using the current prompt PtP_t. All incorrect cases are provided to the expert LLM, which proposes a single edit or new section fmf_m meant to address the shared failure (allowed operations: add, edit, or delete a section/subsection).
  • Batch-Level Aggregation: The mini-batch feedback proposals {fm}\{f_m\} are aggregated by the expert LLM into a concise cluster-level edit FjF_j. To encourage generalizability, aggregation includes a small number of random out-of-cluster errors.

Edits are applied to the prompt (apply(Pt,Fj)\texttt{apply}(P_t, F_j)), generating new candidate prompts {Q(1),Q(2),}\{Q^{(1)}, Q^{(2)}, \dots \}, which are evaluated on a held-out set or the originating cluster. The prompt update rule uses either a greedy strategy: Pt+1=argmaxQ{apply(Pt,Fj)}{Pt}fLLM(Q;Dv)P_{t+1} = \arg\max_{Q \in \{\texttt{apply}(P_t,F_j)\} \cup \{P_t\}} f_{\rm LLM}(Q; D_v) or a beam search maintaining the top-kk prompts.

5. End-to-End UniPrompt Workflow

UniPrompt’s workflow is as follows:

  1. Initialize a baseline prompt P0P_0 from the task description TT or a finetuned “Structured LM.”
  2. Cluster the training set DtD_t into {C1,,Cl}\{C_1, \dots, C_l\}.
  3. Maintain a candidate beam of size kk, initially {P0}\{P_0\}.
  4. For each cluster CjC_j in a round-robin or epoch manner: a. Partition CjC_j into mini-batches {mj,1,}\{ m_{j,1}, \dots \}. b. For each mini-batch, gather feedback and aggregate. c. Apply aggregated edit FjF_j to all prompts in the beam, generate new candidates, evaluate, and update to the top-kk. d. Optionally, early stop if no validation improvement after a fixed patience interval.
  5. After a predetermined number of epochs or convergence, return the highest-performing prompt PP^* in the beam.

A typical UniPrompt run with k=2k=2, mini-batch size 4, l=10l=10 clusters, and 20 epochs, requires 5–7 hours of LLM calls at 0.5 qps.

6. Empirical Results and Performance Comparison

UniPrompt has been evaluated in both zero-shot and few-shot regimes on the following benchmarks:

  • Ethos (hate speech classification)
  • ARC-Challenge (science question answering)
  • MedQA (medical multiple-choice)
  • GSM8K (grade-school math word problems)

Using GPT-3.5-Turbo as the solver LLM (with chain-of-thought) and GPT-4 for expert feedback, UniPrompt yielded these zero-shot results (accuracy in %):

Method Ethos ARC Ch. MedQA GSM8K
Expert Prompt 74.1 78.4 53.1 78.9
CoT 72.0 79.4 50.3 76.3
ProTeGi 76.0 78.8 52.9 77.3
Evoke 63.5 89.0 52.8 81.0
EvoPrompt 81.6 89.9 50.3 81.4
UniPrompt (greedy) 93.7 90.5 55.5 82.3
UniPrompt (beam) 92.3 86.0 57.1 82.4

On Ethos, UniPrompt achieved an absolute improvement of +18%+18\% over expert-tuned prompts. On ARC-Challenge and GSM8K, UniPrompt matches or outperforms the best prior automated prompting methods.

In few-shot settings (using GPT-4 solver) against MedPrompt, UniPrompt matched MedPrompt’s performance without additional inference calls. With added kNN in-context selection, chain-of-thought, and ensembling, UniPrompt exceeded MedPrompt’s accuracy by up to 4 points on average on MedQA, PubMedQA, MedMCQA, and MMLU-MG.

A real-world search query intent task further demonstrated UniPrompt’s capability, where the learned prompt improved negative-class accuracy by 5.8 percentage points and overall accuracy by 1.9 percentage points compared to a highly-tuned manual prompt.

7. Illustrative Prompt Excerpts and Facet Genesis

Examples of UniPrompt-discovered prompt sections:

  • Ethos (Hate Speech) Prompt:
    • Introduction: “Determine whether the following text is hate speech or not.”
    • Definition: “Hate speech targets a protected group with intent to demean or incite violence. It differs from mere profanity or insult…”
    • Key Facets / Pitfalls:
    • “Beware of satire/irony that may not actually target a group.”
    • “Differentiate between negative opinions (‘I dislike X’) and calls to harm.”
    • “Check context: sometimes neutral words become hateful if used in a broader hateful narrative.”
  • GSM8K (Math Word Problems) Prompt:
    • Introduction: “You are given a math word problem; solve it step by step.”
    • Understanding Quantities: “Read each sentence carefully; identify whether quantities are static or per-unit over time.”
    • Averaging Strategy: “To compute averages, sum all relevant values then divide by the number of periods; confirm units.”
    • Past vs Future Events: “Distinguish starting vs ending points to compute durations correctly.”

These prompt sections originated from clusters of training examples exhibiting shared failure modes, as identified by clustered feedback. For instance, the “Pitfalls” section in the Ethos prompt arose due to a cluster consistently yielding incorrect predictions; aggregation of feedback from the corresponding mini-batches resulted in this new task facet. Similarly, the math-strategy blocks in GSM8K stemmed from clusters where the model failed on problems relating to temporal quantities and averaging, driving the creation of explicit sections addressing these facets.

Together, these results support the efficacy of Task Facet Learning through structured prompt sections, example clustering, and iterative LLM-guided feedback. The approach enables the synthesis of complex, interpretable prompts that are empirically superior to both manual and automated baselines, especially for challenging tasks where multiple semantic facets are critical (Juneja et al., 15 Jun 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Task Facet Learning (UniPrompt).