Textual SGD with Momentum (TSGD-M)

Updated 24 January 2026

TSGD-M is a prompt optimization framework that integrates momentum-based sampling with textual gradient descent to refine natural language prompts for LLMs.
The method aggregates historical prompts through an exponential moving average to reduce variance and improve prediction accuracy across diverse NLP benchmarks.
Empirical evaluations show that TSGD-M increases test accuracy by up to 5 percentage points and reduces variance by as much as 30%, particularly benefiting smaller models.

Textual Stochastic Gradient Descent with Momentum (TSGD-M) is a prompt optimization framework designed for LLMs, extending the Textual Gradient Descent (TGD) method by integrating sampling-based momentum mechanisms. TSGD-M facilitates scalable in-context learning via token-level reweighting of prompt generation, providing enhanced stability and performance on diverse NLP tasks, particularly as the volume of training data and the complexity of downstream problems grow (Ding et al., 31 May 2025).

1. Formalization and Problem Setup

The prompt optimization problem is formulated by treating a natural language prompt $p$ (the meta-instruction to the LLM) as the parameter to be optimized. Given a labeled training set $D = \{(x_i, y_i)\}_{i=1}^N$ , each input $x_i$ is concatenated with the current prompt $p$ , and processed by the LLM to produce a prediction $\hat{y}$ . Task performance is measured by a metric $\mathrm{Perf}(\hat{y}, y)$ , commonly accuracy. The objective is:

$p^* = \arg\max_{p \in V^*} \mathbb{E}_{(x, y) \sim D}[\mathrm{Perf}(\mathrm{LLM}([p, x]), y)]$

Standard TGD employs minibatch-based iterative refinement, where the LLM analyzes prediction errors over a subset $B_t$ to generate a textual gradient $g_t$ , prompting updates to $p_t$ . TSGD-M enhances this process by aggregating all past prompts $\{p_0, \ldots, p_t\}$ in a momentum buffer, leveraging their history to inform the sampling procedure for the next prompt generation.

2. Momentum Sampling: Algorithmic Mechanics

TSGD-M adapts classic momentum-based SGD to the textual domain, where explicit vector parameters and learning rates are absent. The method implements an exponential moving average over prompt sources using a decay/momentum parameter $\alpha \in (0,1)$ :

The weight for each previous prompt $p_\tau$ is defined:

$w_\tau = \frac{\alpha^{t-\tau}}{\sum_{i=0}^t \alpha^{t-i}}$

At each token position $i$ while generating a candidate prompt, an index $\tau \in \{0, \ldots, t\}$ is sampled according to $P(\tau) = w_\tau$ , and the LLM’s next-token distribution is conditioned on $p_\tau$ (optionally using the associated feedback $g_\tau$ ):

$P(\mathrm{Token}_{i+1} \mid \mathrm{Token}_{1:i}, \{p_\tau\}_{\tau=0}^t) \propto \sum_{\tau=0}^t w_\tau\, P(\mathrm{Token}_{i+1} \mid \mathrm{Token}_{1:i}, p_\tau, [g_\tau])$

This approach is analogous to numerical momentum (for SGD), where recent prompts wield higher influence. Concretely, the candidate prompt is synthesized token-by-token via stochastic sampling from the weighted mixture of historical prompts.

3. Implementation and Pseudocode

A two-level update scheme is defined: standard Update and momentum-based Update-Mom. The method proceeds over $T$ iterations with each batch $B_t$ ; Update-Mom is invoked if use_mom is set.

High-Level TSGD-M Algorithm:

Input: LLM, initial prompt $p_0$ , data $D$ , batch size $m$ , iterations $T$ , coefficient $\alpha$ , candidates $k$ , max tokens $T_{max}$ , scoring function $S(\cdot)$ , refinement template $p_{refine}$ .
For $t = 0$ $t = 0$ to $T-1$ $T - 1$ :
1. Draw minibatch $B_t$ .
2. Compute outputs $\hat{y}$ .
3. If use_mom: $p_{t+1} \leftarrow \mathrm{Update\text{-}Mom}(\cdots)$ . Else: $p_{t+1} \leftarrow \mathrm{Update}(\cdots)$ .

Subroutine Update-Mom:

Compute weights $w_\tau$ over prompts.
For $k$ candidates, generate prompts by sampling $\tau$ and performing LLM token generation conditioned on $p_{refine}\,\Vert\,p_\tau\,[\Vert\,g_\tau]\,\Vert\,z$ .
Select the best candidate by $S(\cdot)$ .

Empirically, generating in chunks (e.g., every 10 tokens) instead of strictly token-by-token preserves most momentum effects, a pragmatic adaptation for LLM APIs lacking fine-grained token control.

4. Computational Complexity and Memory

Each iteration requires:

$m$ forward passes (one per batch example).
$k \times T_{max}$ single-token generations.
Overhead $O(t + T_{max} + k)$ for weight computation and sampling.

Memory grows linearly with $t$ to store the prompt buffer ${p_\tau}_{\tau=0}^t$ ; in practical setups, the buffer remains small ( $t \lesssim 20$ ).

5. Hyperparameter Regimes and Tuning

The core hyperparameters include:

Momentum coefficient $\alpha$ in $\{0.0,\,0.3,\,0.6,\,0.9\}$ ; $\alpha \approx 0.6$ attains a balance between variance reduction and responsiveness.
Batch size $m \in [3, 25]$ ; $10 \leq m \leq 20$ preferred due to context limits.
Number of candidate prompts $k \approx 20$ .
Max prompt length $T_{max} \approx 100$ tokens; longer for tasks such as GSM8K.
Early stopping: Halt after 2 (conservative, $H_0$ ) or up to 5 (exploratory, $H_1$ ) iterations lacking improvement.
LLM temperature: 0.7 for $H_0$ ; 1.1 for $H_1$ .

These settings are chosen to moderate the variance–diversity tradeoff and model-context constraints.

6. Empirical Performance Across Tasks and Models

TSGD-M was validated across nine NLP benchmarks: BIG-Bench Hard (e.g., Hyperbaton, Navigate), natural language understanding (MPQA, Trec, Subj, Disaster, Airline, SST2), and math reasoning (GSM8K). Models included Llama3-8B, Mistral-7B, Deepseek-1.5B, and GPT-4/GPT-3.5.

Empirical outcomes demonstrate:

Under $H_0$ (conservative regime, $\alpha=0.6$ ), test accuracy improved by 1–4 pp (percentage points) over vanilla TSGD. For Llama3-8B (DLN1): Subj: 69.03→71.20; Hyperbaton: 83.07→85.53; GSM8K(dev): 76.53→79.80.
$H_1$ (higher temperature, more iterations) further enlarged gains, by 1–2 pp (notably for reasoning).
Smaller models (Deepseek-1.5B) responded more strongly, with lifts up to 5 pp; this suggests momentum sampling stabilizes updates in less expressive LMs.
Standard deviation analysis over 10 runs yielded variance reductions of up to 30% versus standard TSGD.

7. Theoretical Insights, Limitations, and Future Directions

Theoretical analysis (Appendix, (Ding et al., 31 May 2025)) utilizes a scalar mean-squared error (MSE) model, demonstrating that the exponential moving average in momentum sampling reduces variance: $\mathrm{Var}(Y_t) < \mathrm{Var}(X_t)$ .

Limitations include increased generation overhead per LM call and the need to store all historical prompts, though buffer size is typically restricted. Additionally, some APIs necessitate chunked rather than token-by-token generation.

Potential extensions involve integrating TSGD-M with two-stage prompt refinement schemes (e.g., analyze-refine), more efficient buffer management (e.g., sliding windows), adaptive momentum scheduling, and combining with synthetic data pipelines.

TSGD-M serves as a lightweight, modular augmentation to established prompt optimization techniques. By reweighting prompt sources during candidate synthesis, it improves accuracy, reduces stochasticity, and affords scalability across data scales and architectural variants (Ding et al., 31 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling Textual Gradients via Sampling-Based Momentum (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Textual Stochastic Gradient Descent with Momentum (TSGD-M).

Textual SGD with Momentum (TSGD-M)

1. Formalization and Problem Setup

2. Momentum Sampling: Algorithmic Mechanics

3. Implementation and Pseudocode

4. Computational Complexity and Memory

5. Hyperparameter Regimes and Tuning

6. Empirical Performance Across Tasks and Models

7. Theoretical Insights, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Textual SGD with Momentum (TSGD-M)

1. Formalization and Problem Setup

2. Momentum Sampling: Algorithmic Mechanics

3. Implementation and Pseudocode

4. Computational Complexity and Memory

5. Hyperparameter Regimes and Tuning

6. Empirical Performance Across Tasks and Models

7. Theoretical Insights, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research