Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Large Language Monkeys: Scaling & Safety

Updated 12 July 2025
  • Large Language Monkeys are a framework for analyzing large language model inference using repeated, stochastic sampling to enhance task coverage.
  • The approach leverages temperature-based random sampling to amplify performance, with coverage improvements validated through empirical pass@k metrics.
  • It also investigates safety vulnerabilities, showing how repeated, low-level perturbations can bypass protections, prompting enhanced alignment strategies.

Large Language Monkeys refer to the capabilities and behaviors of LLMs when inference-time compute is scaled along a new axis: repeated, independent sampling of candidate solutions. This notion encompasses both performance amplification—by running many attempts during inference rather than relying on a single pass—and the robustness or vulnerability of safety alignment mechanisms against repeated or randomly perturbed attempts. The term is used in the technical literature both as an analogy to random “typing monkeys” and as a framework for the paper of inference compute scaling laws, task coverage, and the dynamics of (multi)modal LLM safety.

1. Inference Compute Scaling and Repeated Sampling

Inference compute scaling departs from the traditional model-scaling paradigm—where model size, dataset breadth, and training computation (FLOPs) are increased—by instead focusing on runtime computational expansion after the model is trained. In this routine, the model is allowed to generate kk independent solutions for a single prompt, rather than making a single, deterministic prediction. This process is typically implemented by repeating temperature-based stochastic sampling for each prompt, ensuring a diversity of outputs.

The foundational concept is that if a LLM assigns nonzero probability to a correct answer, then generating more samples probabilistically increases the likelihood of observing a correct solution among all candidates. In effect, even weaker or less capable models can be amplified through brute-force inference compute, compensating for modeling limitations by leveraging runtime exploration (Brown et al., 31 Jul 2024).

2. Scaling Laws and Performance Metrics

A central finding is that as the number of inference attempts kk increases, “coverage”—the fraction of problems solved by at least one sample—can scale over several orders of magnitude. Empirical measurements demonstrate that coverage as a function of the number of samples, c=c(k)c = c(k), often follows an exponentiated power law:

log(c)akb    cexp(akb),\log(c) \approx a \cdot k^{-b} \implies c \approx \exp(a \cdot k^{-b}),

where aa and bb are fit parameters. This relationship suggests that performance on “pass@k” metrics enjoys smooth, log-linear improvement on domains with reliable verification, such as code synthesis or formal theorem proving (Brown et al., 31 Jul 2024).

The paper "How Do Large Language Monkeys Get Their Power (Laws)?" (Schaeffer et al., 24 Feb 2025) formalizes the underlying mathematics: for each individual problem, if the model’s probability of success in a single attempt is pp, the probability of at least one success in kk attempts is 1(1p)k1 - (1-p)^k, which decays exponentially in kk for fixed pp. However, in practice, when averaged across a dataset with highly variable per-problem pp, the aggregate error –log(passD@k)\text{–log}(pass_D@k) empirically decreases as a polynomial (akba\cdot k^{-b}), not exponentially, in kk. This apparent contradiction is resolved by recognizing that the distribution of per-problem pp values is heavy-tailed: a small number of tasks with very low pp come to dominate aggregate scaling.

3. Domain-Specific Applications and Verification

Domains such as program synthesis (e.g., unit-tested code generation) and formal mathematical proof assistants offer built-in, automatic verifiers. In these contexts, each independently sampled solution can be precisely and reliably judged for correctness. The result is that scaling up the number of inference attempts leads to a straightforward and significant increase in task coverage and practical success rates (Brown et al., 31 Jul 2024).

Table: Single vs. Multi-Sample Coverage on SWE-bench Lite

Sample Budget DeepSeek-Coder-V2-Instruct Coverage State-of-the-art Single Sample
1 15.9% 43%
250 56% 43%

In the SWE-bench Lite case, expanding to 250 samples per issue increases problem coverage from 15.9% (single sample) to 56%, surpassing the previous state-of-the-art achieved with a single sample from a more powerful model.

In contrast, domains lacking reliable verifiers (e.g., grade-school math or complex word problems) present a challenge. Although the raw coverage can approach high values with enough samples, prevailing answer selection strategies (majority voting, reward model scoring) plateau well below the upper bound defined by raw coverage. In these situations, rare but correct solutions are easily drowned out in the sample pool, highlighting a need for improved verification or selection mechanisms.

4. Theoretical Insights: Distributional Perspective and Power-Law Scaling

A core theoretical contribution (Schaeffer et al., 24 Feb 2025) is the reconciliation between per-problem exponential scaling and aggregate power-law scaling. For a single problem, the multi-attempt success rate grows as 1(1p)k1 - (1-p)^k, clearly exponential in kk; yet, averaged across a set of problems where per-instance pp is drawn from a heavy-tailed distribution, the negative logarithm of the average success rate obeys a power law:

log(passD@k)CΓ(b)kb,- \log(pass_D@k) \sim C \cdot \Gamma(b) \cdot k^{-b},

where CC and bb are determined by the lower-tail behavior of f(p)f(p), the distribution of single-attempt success rates. This framework explains why aggregate scaling is not as rapid as naive per-problem scaling would suggest: a small subset of exceptionally difficult problems (very low pp) dominates the overall error as kk increases. When the f(p)f(p) lacks a sufficiently heavy lower tail, as sometimes occurs in curated evaluation sets, the aggregate scaling can deviate from a strict power law.

A practical consequence is that the power-law exponent bb can be efficiently estimated by fitting a parametric distribution to the observed single-attempt pass rates, with order-of-magnitude reductions in compute or data requirements over naive log-log linear fits.

5. Safety Alignment and Stochastic Monkeys

The resilience of LLMs to safety-bypassing attacks is affected by the “stochastic monkey” paradigm, in which attackers (low-resource, unsophisticated users) conduct repeated or randomly perturbed attempts rather than constructing sophisticated adversarial prompts (Vega et al., 5 Nov 2024). The paper formalizes the attack: for each harmful input xx, k=25k = 25 random augmentations are generated, and the attacker succeeds if at least one attempt bypasses safety filters according to a safety judge.

Two primary random augmentation strategies are investigated:

  • String Insertion Augmentations: Inserting random character sequences at various prompt locations.
  • Character-Level Augmentations: Editing, inserting, or deleting individual characters in the prompt.

It is observed that character-level changes can improve jailbreak success for harmful content by 11–21% in numerous safety-aligned models, often by subtly altering tokenization in ways not addressed during safety alignment training. The effectiveness of repeated, random perturbation as a bypass highlights an axis of vulnerability: models may be brittle to “stochastic monkeys” who simply try many random variations, not just sophisticated adversaries.

Defensive measures (circuit breakers, adversarial training) can mitigate, but are sometimes circumvented if augmentations are less intense. The relative impact of defensive strategies, model size, quantization, and decoding temperature is ranked, with fine-tuning defenses exerting the strongest influence, followed by model size, quantization, and then decoding configuration.

6. Practical Implications and Compute-Performance Tradeoffs

Applying inference compute scaling can yield cost-effective gains: for a fixed inference FLOPs budget, running many samples from a less powerful model may outperform a handful of generations from a more advanced, expensive model (Brown et al., 31 Jul 2024). This result is particularly pronounced for tasks with automatic verifiers—where success rates rise dramatically with more samples—even as baseline single-sample accuracy remains low.

However, in tasks where no off-the-shelf verifiers exist, gains are limited by the answer-selection bottleneck. Majority voting and reward models fail to scale with sample size beyond a few hundred candidates due to the rarity and dispersion of correct solutions.

From the safety perspective, findings indicate that LLM guardrails must be robust against simple, brute-force random modifications, not solely sophisticated attacks (Vega et al., 5 Nov 2024). Developers may need to employ input pre-processing, robust typo correction, or further restrict model exposure modes to minimize the risk from stochastic, low-sophistication jailbreaks.

7. Future Directions

Key open problems and directions for further investigation include:

  • Developing robust selection/verification for tasks without automatic verifiers, to close the gap between raw coverage and answer accuracy.
  • Analyzing tokenization and noise vulnerabilities, given the potency of character-level perturbations in defeating safety alignment.
  • Refining scaling-law characterization by deeper paper of the distributional properties of task difficulty and their effect on aggregate performance predictions.
  • Integrating hybrid defenses that combine typo correction, input regularization, and adversarially hardened alignment training.
  • Extending analyses to multimodal LLMs, since power-law scaling and stochastic sampling vulnerabilities generalize to images and audio.

A plausible implication is that understanding and modeling the underlying task difficulty distribution—and equipping both verification and alignment apparatuses to handle rare, difficult, or adversarial cases—will become increasingly central to harnessing “Large Language Monkeys” effectively in both practical and safety-critical settings.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.