Papers
Topics
Authors
Recent
2000 character limit reached

Momentum-Aided Prompt Optimization (MAPO)

Updated 19 December 2025
  • MAPO is a framework for automated prompt engineering that iteratively refines prompts using positive natural language gradients and momentum-based updates.
  • It employs beam search candidate expansion and UCB bandit selection to efficiently explore and select high-performing prompt trajectories.
  • MAPO significantly reduces convergence time and API calls while enhancing F1 scores, as demonstrated on fake-news and hate-speech detection tasks.

Momentum-Aided Prompt Optimization (MAPO) is a framework for automated prompt engineering that enhances the efficiency and efficacy of prompt optimization for LLMs. By employing positive textual gradients and a momentum mechanism, MAPO refines prompts iteratively, leveraging beam search and Upper Confidence Bound (UCB) bandit selection to balance candidate expansion and selection. This results in faster convergence, reduced API call requirements, and improved downstream task performance relative to prior art such as ProTeGi (Cui et al., 25 Oct 2024).

1. Positive Natural Language Gradients

MAPO conceptualizes prompt optimization as an iterative process where the prompt pp (a natural language string) is continuously refined using feedback from an LLM. At each iteration tt, given a minibatch Dt={(xi,yi)}i=1mD_t = \{(x_i, y_i)\}_{i=1}^m of input–label pairs, the current prompt ptp_t is applied to generate predictions y^i=LLM(pt,xi)\hat{y}_i = \mathrm{LLM}(p_t, x_i). The subset St={sj}S_t = \{s_j\} of predictions matching true labels (correct outputs) is extracted.

MAPO then invokes the LLM with a fixed gradient-eliciting template τ\tau to "praise and improve" ptp_t utilizing each sjSts_j \in S_t, producing textual guidance pt(j)\nabla p_t^{(j)}. These gradients are interpreted as semantically directed improvements, not numeric derivatives. The set {pt(j)}\{\nabla p_t^{(j)}\} is passed back into the LLM via a static template α\alpha, which aggregates and applies these gradients to yield a new candidate prompt. Formally, one imagines a mapping Φ\Phi from text to semantic space, and the average semantic direction is

gt=1Stj=1StΦ(pt(j))g_t = \frac{1}{|S_t|} \sum_{j=1}^{|S_t|} \Phi(\nabla p_t^{(j)})

though Φ\Phi is not computed explicitly; aggregation occurs entirely in language space using the LLM.

2. Momentum-Based Optimization

MAPO introduces a textual momentum buffer vtv_t, analogous to the velocity vector in momentum SGD, to stabilize optimization and mitigate oscillation/local minima. At each round,

vt=μvt1+ηgt,pt+1=pt    vtv_t = \mu\,v_{t-1} + \eta\,g_t,\quad p_{t+1} = p_t \;\oplus\; v_t

where η\eta is the learning rate controlling semantic step size and μ\mu (0μ<10 \leq \mu < 1, typically $0.9$) is the momentum coefficient. The operation “\oplus” denotes the semantic-update step implemented by feeding ptp_t and vtv_t into the LLM via prompt α\alpha. To ensure diversity in updates, gtg_t is typically sampled randomly from the pool of positive gradients in the top-kk beam.

Ablation analysis reveals that disabling momentum slows convergence by approximately 54%, though final F1 scores remain nearly unchanged. This suggests momentum primarily accelerates optimization stability rather than increasing absolute peak performance.

3. Beam Search Candidate Expansion

MAPO maintains a beam of size kk (default k=4k=4) containing the top-performing prompt candidates at each round. For each beam member ptip_t^i, a momentum-augmented gradient step generates bb new continuations (default b=3b=3), resulting in up to kbk \cdot b new candidates. Each candidate pcp_c is evaluated on a separate validation minibatch using a task-specific metric (typically F1 score), and the top kk scoring candidates form the beam for the subsequent iteration.

This beam-based expansion ensures exploration of multiple prompt trajectories, increasing the likelihood of discovering high-performing semantic configurations.

4. Upper Confidence Bound (UCB) Bandit Selection

To balance exploration and exploitation in selecting beam prompts, MAPO employs a UCB score for candidate selection. For candidate ii, the score is

UCBi=sˉi+c2lnNniUCB_i = \bar s_i + c\,\sqrt{\frac{2\ln N}{n_i}}

where sˉi\bar s_i is the cumulative mean minibatch score (F1), nin_i the count of times candidate ii has been selected/evaluated, N=jnjN = \sum_j n_j the total evaluations, and cc (>0>0, default $1.0$) the exploration constant.

At each round, UCB scores are computed for all candidates, and the kk highest scoring prompts are retained. This mechanism prevents premature pruning of promising under-explored candidates and supports efficient search over the prompt landscape.

5. Formal Framework and Algorithm

MAPO’s overall workflow is summarized by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
INPUTS: p_0, m, k, b, η, μ, c, T or δ
INITIALIZE: v_0 ← 0; Beam B ← {p_0}; For i in B: n_i←0, \bar s_i←0
FOR t = 1…T:
    Candidates C ← ∅
    FOR each p ∈ B:
        Sample minibatch D_t
        Compute correct outputs S_t ⊆ D_t
        Generate positive gradients {∇p^{(j)} via LLM(τ,p,S_t)}
        Form g_t by random pick or average of {∇p^{(j)}}
        v_t ← μ·v_{t−1} + η·g_t
        FOR e = 1…b:
            p_c ← LLM(α, p, v_t) # momentum update
            s ← Evaluate(p_c)    # F1 on D_t or held-out
            if p_c new: n_c←0,\bar s_c←0
            n_c ← n_c +1
            \bar s_c ← \bar s_c + (s - \bar s_c)/n_c
            C ← C ∪ {p_c}
    END FOR
    N ← ∑_{i∈C} n_i
    Compute UCB_i = \bar s_i + c·√(2 ln N / n_i) ∀ i∈C
    B ← top k candidates from C ranked by UCB_i
    If improvement in max \bar s_i < δ: break
OUTPUT best prompt in B

Recommended defaults include minibatch size m=64m = 64, beam width k=4k = 4, expansions per beam b=3b = 3, gradient learning rate η=0.1\eta = 0.1–$0.2$, momentum μ=0.9\mu = 0.9, exploration constant c=1.0c = 1.0, and LLM temperature set to 0 for deterministic outputs.

6. Experimental Protocol and Benchmarking

MAPO has been evaluated in fake-news detection (“Liar” dataset) and hate-speech detection (“Ethos” dataset) tasks, using 200 held-out test samples per dataset. Each iteration generates two positive gradients, contrasting with ProTeGi’s four negative gradients. Metrics are reported as the best F1 over beam candidates (averaged over three runs); total API calls per run and wall-clock time are monitored and used as convergence criteria (defined by no F1 improvement or reaching the maximum round count).

Empirical results are presented in the table below:

Dataset Method Wall-Clock (s) API Calls Peak F1
Liar ProTeGi 735 417 0.581
Liar MAPO 285 (–61%) 77 (–81%) 0.612 (+5.3%)
Ethos ProTeGi 686 402 0.905
Ethos MAPO 136 (–80%) 62 (–85%) 0.951 (+5.2%)

MAPO achieves reductions in convergence time by 72.7%\approx 72.7\%, API calls by 81.5%\approx 81.5\%, and peak F1 improvement by 5.37%\approx 5.37\%. Convergence curves are characterized by fewer oscillations and plateau at higher F1 values.

7. Implementation Guidelines and Practical Considerations

MAPO is designed for black-box LLM APIs and can be implemented with the following recommendations:

  • Hyperparameters: η=0.1\eta = 0.1–$0.2$, μ=0.9\mu = 0.9, k=4k = 4, b=3b = 3, T=6T = 6, c=1.0c = 1.0 (tunable $0.5$–$2.0$), minibatch size $64$.
  • LLM outputs: Use temperature $0$ for deterministic generations.
  • Variance: Results should be averaged over 3\geq 3 runs to account for LLM variability.
  • Reproducibility: Fix model version (e.g., October 2024 GPT-3.5-turbo) and random seeds for batching/sampling.
  • Efficiency: Cache repeat evaluations of identical prompts to minimize redundant API calls.
  • Early Stopping: Monitor per-round F1 gains; stopping often occurs by round 4.
  • Resource Constraints: Reduce beam width or gradients per round to trade off efficacy for computational savings.

This suggests MAPO’s modularity allows adaptation for various budget and deployment scenarios, provided the recommended practices are followed.

By integrating positive natural language gradients, momentum-based semantic steps, beam search candidate generation, and UCB-based bandit selection, MAPO constructs a robust and scalable framework for automated prompt engineering, surpassing previous methods in stability, efficiency, and effectiveness for optimizing prompts in LLM-driven applications (Cui et al., 25 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Momentum-Aided Prompt Optimization (MAPO).