Papers
Topics
Authors
Recent
2000 character limit reached

Variator Agent: Boosting LLM Pass@k

Updated 21 November 2025
  • Variator agent is a task-agnostic strategy that generates diverse yet semantically equivalent prompt variants to enhance Pass@k success rates in LLM evaluations.
  • It employs a formal probabilistic model that perturbs input success probabilities using a uniform spread, ensuring systematic diversity.
  • Empirical results in coding and cybersecurity challenges show a consistent performance gain over traditional repetition-based sampling methods.

The Variator agent is a task-agnostic orchestration strategy for LLMs that systematically exploits model inconsistency to enhance Pass@k performance. By generating kk equivalent variants of a challenge—each differing in superficial aspects but strictly preserving the input/output structure—the Variator agent increases the probability that at least one generated candidate will succeed, outperforming traditional repetition-based sampling approaches in multi-candidate evaluation settings such as automated coding and cybersecurity tasks (Dalal et al., 19 May 2025).

1. Motivation and Key Concept

LLMs exhibit notable “inconsistency,” where semantically equivalent prompts can elicit widely varying response success rates. Traditional LLM deployment paradigms view this as a reliability deficit and attempt to minimize such variability. In Pass@k settings, where one is permitted up to kk distinct submissions and judged on whether any response is correct, the optimal strategy changes. Instead of sampling kk independent solutions to the same prompt (“Repeater agent”), the Variator agent produces kk diverse prompt variants, generating and submitting one solution for each. These variants retain the I/O semantics but introduce diversity in aspects such as wording, problem narrative, variable names, and description order. This diversity systematically increases the model’s coverage of response space and leverages the nonlinear regime of the Pass@k success metric, converting inconsistency into a tangible benefit even when mean success probability across variants equals that of the original prompt.

2. Formal Model and Theoretical Guarantees

The probabilistic modeling of the Variator agent revolves around perturbing the original prompt’s success probability p0p_0. For each variant, the model posits: Pv=[p0+W]01withWUniform([w,w])P_v = [p_0 + W]_{0}^{1} \quad\text{with}\quad W \sim \text{Uniform}([-w,w]) where []01[\cdot]_{0}^{1} denotes clipping to [0,1][0, 1], and ww is the “spread” reflecting the degree of variant-induced fluctuation.

Pass@k Computation

  • Repeater agent: Samples kk IID solutions for the original prompt. The success metric is:

Pass@kRepeater=1(1p0)kPass@k_{\text{Repeater}} = 1 - (1 - p_0)^k

  • Variator agent: Samples one solution per kk independent variants, with respective success probabilities pjp_j:

Pass@kVariator=1j=1k(1pj)Pass@k_{\text{Variator}} = 1 - \prod_{j=1}^k (1 - p_j)

Averaging over the randomization, if pv=E[Pv]p_v = E[P_v], then:

E[Pass@kVariator]=1(1pv)kE[Pass@k_{\text{Variator}}] = 1 - (1 - p_v)^k

Theorem: Performance Bound

For variant generation under symmetric spread ww, for all p0[0,1]p_0 \in [0,1] and k1k \geq 1:

  • Performance guarantee:

Pass@kVariator1(1w4)kPass@k_{\mathrm{Variator}} \geq 1 - (1 - \tfrac{w}{4})^k

  • Regret guarantee (relative to Repeater agent):

Pass@kVariatorPass@kRepeater(w4)kPass@k_{\mathrm{Variator}} \geq Pass@k_{\mathrm{Repeater}} - \left(\tfrac{w}{4}\right)^k

These formal results are derived by explicit computation of pvp_v as a clipped uniform perturbation and application of Jensen's inequality and minimization over p0p_0; the full derivation is presented in Appendix A of the referenced paper.

3. Variant Generation Algorithm

Variant generation for the Variator agent is governed by structured, task-agnostic prompting with rigorous equivalence constraints. The LLM solving the variants is also used to generate them, mitigating cross-model bias.

  • Coding tasks: The agent instructs the LLM to create alternate challenge descriptions with novel backstories, variable renaming, notation changes, and reordered exposition, strictly preserving I/O signature and examples. Prompts are wrapped in <challenge>…</challenge> and assigned distinct <title>…</title> labels.
  • Cybersecurity (CTF) tasks: Additional constraints require preservation of network interface, file names, protocol, and solution compatibility, with each variant featuring a unique thematic or design element.

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
function GenerateVariants(original_prompt, k):
  variants = []
  used_titles = {}
  for j in 1..k:
    prompt_j = build_variant_prompt(original_prompt, used_titles)
    response = LLM.generate(prompt_j)
    title_j, challenge_j = parse(response)
    if title_j not in used_titles:
      variants.append(challenge_j)
      used_titles.add(title_j)
    else:
      j = j - 1    # retry for unique title
  return variants

Hyperparameters (e.g., kk) are selected to match the desired Pass@k regime, with temperature maximized (=1=1) for both prompt and solution generation to ensure the greatest possible diversity.

4. Implementation and Experimental Protocol

Experiments were conducted with frontier LLMs—Claude 3.7 Sonnet (using AWS Bedrock) and OpenAI o3-mini (via Azure/OpenAI API)—using coding and cybersecurity challenge datasets.

  • Prompt and solution sampling: Both employ temperature =1=1 for diversity, with model-specific configuration (e.g., Claude’s “extended thinking” mode with 4,000 thinking tokens and 10,000 total tokens; o3-mini “medium” reasoning effort).
  • Sampling protocol: For each challenge, 25 prompt variants were generated, each sampled for 6 candidate solutions (total 150 samples per challenge).
  • Datasets: Experiments utilized 60 competition-level APPS coding problems (each with 60\geq 60 unit tests). For private APPS data, a single challenge variant was designated “original,” and the rest served as variants, eliminating memorization bias.

Performance statistics were computed with closed-form formulas on empirically measured p0p_0 and pvp_v (150 solutions for p0p_0, 6 per variant for pvp_v).

Results Table (mean Pass@k over 60 problems):

Model Agent Pass@1 Pass@5 Pass@10 Pass@15 Pass@20
Claude 3.7 Sonnet Repeater 29.8% 40.4% 44.1% 46.1% 47.5%
Variator 26.5% 40.3% 44.6% 46.7% 48.0%
OpenAI o3-mini Repeater 57.1% 70.4% 73.6% 75.0% 75.7%
Variator 55.3% 70.9% 74.5% 75.9% 76.6%

A consistent absolute gain (>1%>1\%) for large kk is observed in both public and private datasets, with statistical significance confirmed by Monte Carlo testing (pp-values <5×104< 5 \times 10^{-4}).

5. Empirical Analysis of Inconsistency and Domain Extension

Detailed variance analysis included 30 variants per challenge, each sampled 50 times. Significant prompt-sensitivity was persistent for both Claude 3.7 and o3-mini, across coding and cybersecurity (CTF) domains. Empirical histograms illustrated extensive spread in success per variant, with p<104p < 10^{-4} for hypothesis tests against a null of no inconsistency.

Two novel CTF challenges—one targeting online enumeration keystore and one pickle RCE—were validated by human experts, verifying challenge equivalence (excluding 6% of non-equivalent coding variants). Despite advanced reasoning and chain-of-thought output, both models exhibited nontrivial prompt-fluctuation, indicating that even sophisticated LLM reasoning is not immune.

A plausible implication is continued relevance of the Variator agent to future LLM generations and advanced reasoning architectures, as the underlying inconsistency effect is robust to model design.

6. Generality, Limitations, and Future Directions

The theoretical framework for the Variator agent relies solely on symmetric perturbation assumptions and independence of variant success rates, without specific dependence on model internals. As long as LLMs exhibit variant-induced response fluctuations, the Pass@k advantage conferred by diversified prompt sampling will persist and may scale favorably with larger kk.

Potential extensions include automated equivalence checking for prompt variants and investigation of training paradigms that enhance model robustness via exposure to variant-rich datasets. The persistent Pass@k enhancement observed in advanced models suggests that the technique is domain-general and “future-proof” for multi-candidate evaluation.

7. Summary and Significance

The Variator agent transforms inherent brittle behavior in LLMs into a strength for any setting governed by Pass@k metrics. By systematically generating kk diverse, equivalent prompt variants and sampling solutions accordingly, it raises the probability that at least one will successfully solve the underlying task. The approach is supported by formal theoretical guarantees, broad empirical validation across coding and cybersecurity, and demonstration of persistence in frontier reasoning models. These results position the Variator agent as an effective orchestration strategy for a range of LLM applications where multi-candidate success rates are critical (Dalal et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Variator Agent.