Papers
Topics
Authors
Recent
Search
2000 character limit reached

Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

Published 26 Mar 2026 in cs.CL and cs.LG | (2603.24917v1)

Abstract: Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

Summary

  • The paper presents DCBS as a deterministic algorithm for estimating near-verbatim extraction risk in language models.
  • It integrates beam search with edit-distance constraints to efficiently capture risk beyond strict verbatim matches.
  • Experimental results reveal that near-verbatim risk estimates significantly exceed traditional verbatim extraction, indicating heightened privacy and copyright concerns.

Introduction and Motivation

The assessment of memorization and data extraction risks in LLMs has become an essential area of study, particularly given the increasing deployment of open-weight and production-grade models across domains where privacy and copyright are significant. Prior approaches have focused primarily on verbatim extraction: determining the ability to reproduce training data exactly or with deterministic greedy decoding. However, these approaches can both underestimate the extent and risk profile of memorization. Notably, near-verbatim extraction—where small edits, insertions, or deletions separate model output from the training target—can be just as problematic for privacy or copyright, yet is missed by standard methods.

This paper systematically addresses the measurement of near-verbatim extraction risk. It presents a family of algorithms, collectively termed decoding-constrained beam search (DCBS), that yield efficient and deterministic lower bounds on the probability with which a model, given a prefix, can generate a continuation within a bounded edit distance (e.g., Levenshtein or Hamming) from the original training suffix. The approach leverages beam search with per-decoding-step constraints derived from the target decoding policy (e.g., top-kk) and integrates editing-distance–based pruning to improve both efficiency and the tightness of risk estimates.

Formalization of Extraction Risk: Verbatim vs. Near-Verbatim

Standard verbatim extraction computes the likelihood pzp_z of generating the exact target suffix zz under a deterministic or stochastic decoding policy ϕ\phi, typically by:

pzPrθ,ϕ(zx)p_z \coloneqq \Pr_{\theta,\phi}(z \mid x)

where xx is the context prefix, θ\theta the LLM parameters, and ϕ\phi is a next-token sampling strategy (e.g., greedy, top-kk).

This formulation does not capture the risk associated with high-probability variants of zz that are only a small number of substitutions, insertions, or deletions away from the target—the “near-verbatim” risk domain. To address this, the paper generalizes the risk metric to the total probability mass assigned to an pzp_z0-ball about pzp_z1 for some token-level distance function (e.g., Levenshtein):

pzp_z2

where pzp_z3 is the set of continuations within edit distance pzp_z4 of pzp_z5. This re-scopes extraction measurement to align with practical privacy and legal risk—since non-verbatim copies can carry similar liability.

Computational Challenges and Probabilistic Extraction

A major obstacle is the combinatorial growth of pzp_z6. Enumerative approaches are intractable even for moderate pzp_z7 and sequence lengths due to the exponential candidate space. Monte Carlo (MC) sampling is statistically unbiased but becomes computationally prohibitive, as the probability of observing even a single near-verbatim match can require pzp_z8–pzp_z9 samples, especially for low-threshold risk detection (e.g., zz0). Figure 1

Figure 1: Monte Carlo estimation of near-verbatim extraction probability demonstrates that DCBS yields reliable lower bounds orders of magnitude faster.

To mitigate this, the paper introduces an algorithmic framework that enables efficient, deterministic lower bound estimation: decoding-constrained beam search.

Decoding-Constrained Beam Search (DCBS): Algorithmic Framework

The key observation is that high-probability continuations—especially those resulting from memorization—tend to be concentrated in a small region near the training target in the decoding tree. DCBS adapts beam search by restricting expansions to the top-zz1 tokens at each step (matching the decoding scheme of interest, e.g., top-zz2). The algorithm applies distance-based viability pruning, ensuring only candidate continuations that can potentially attain distance zz3 are retained. Key properties include:

  • Deterministic Lower Bound: All final retained completions within the zz4-ball contribute to a provable, deterministic lower bound on the near-verbatim probability mass. DCBS typically achieves bounds capturing zz5 of the MC estimate at less than zz6 of the sampling cost.
  • Incremental Pruning: For Hamming, a token mismatch counter per beam element suffices; for Levenshtein, a banded Wagner-Fischer dynamic program tracks the minimal attainable edit distance incrementally, enabling safe early elimination of hopeless candidates.
  • Scalability: In practice, using beam widths zz7 and standard top-zz8 values, DCBS runs at a computational cost comparable to zz9 MC samples per sequence, several orders of magnitude cheaper than unbiased sampling. Figure 2

    Figure 2: Example of probabilistic extraction for Llama~1-13B on The Great Gatsby under top-ϕ\phi0 decoding; several near-verbatim continuations collectively account for substantially more risk than the verbatim path alone.

Experimental Results

The paper conducts extensive experiments with open LLMs: OLMo 2 (7B, 13B, 32B), Llama 2 (7B, 13B, 70B), and Pythia families, across Wikipedia, Books3 books, and Enron emails. Key findings are as follows:

Extraction Rate Comparison

DCBS reveals that the proportion of training sequences extractable with nontrivial near-verbatim risk is substantially higher than what is measured by verbatim methods. For OLMo 2 32B on held-in Wikipedia, verbatim probabilistic extraction finds 1.42% of sequences extractable; DCBS with ϕ\phi1 finds 2.57%, a near doubling. Figure 3

Figure 3: Extraction rates increase with both model scale and edit-tolerance; DCBS reveals significantly more risk compared to verbatim or greedy checks.

Per-Sequence Risk and Mass Gain

A large fraction of sequences have their extraction “unlocked” only in the near-verbatim regime—i.e., zero or sub-threshold verbatim risk but high near-verbatim probability. Per-sequence risk can increase from zero to ϕ\phi2 or higher under near-verbatim matching in Llama~2-70B. Figure 4

Figure 4: For Llama~2 on The Great Gatsby, many sequences have near-verbatim mass well above the verbatim; points above ϕ\phi3 show substantial risk increase.

Empirical risk gain distributions (CCDF) show that for Llama~2-70B, 12.7% of all Gatsby sequences exhibit a mass gain ϕ\phi4—a sharp increase unobservable to classic verbatim analysis. Figure 5

Figure 5

Figure 5: The CCDF of per-sequence near-verbatim mass gain for Llama~2 confirms systematically increased risk with scale.

Risk Structure and Scaling

The verbatim share of total extraction mass decreases with model scale, especially on structured book text; newly extractable cases tend to be unlocked by DCBS due to increased variation and coverage of near-matches as model parameters grow.

Moreover, inspecting extraction risk across distances reveals that mass can be distributed heterogeneously: for OLMo 2 on Wikipedia, median verbatim share among extractable sequences is often under 5%, with higher near-verbatim risk primarily responsible for extraction.

Negative Controls

Experiments on data not seen during training (Wikipedia pages posted after the OLMo 2 cutoff, or recent books for Llama 2/3) confirm that DCBS does not surface false positives, establishing its conservativeness and reliability as a risk estimator.

Practical and Theoretical Implications

DCBS not only closes a methodological gap by aligning extraction models with real risk (as defined legally and in data privacy), but also offers a practical tool for model assessment and model development workflows. The insights about risk scaling and the structure of memorized output distribution indicate that larger models generalize extraction risk more diffusely, increasing both concern around privacy (private details stored in non-verbatim form) and copyright (substantially similar reproduction). Moreover, the algorithm's efficiency allows its deployment at scale for ongoing model auditing and potentially in the loop during training.

Future Directions

The results open several avenues:

  • Active Risk Auditing: Use DCBS as part of continuous privacy/copyright-risk assessment for in-training and deployed models.
  • Understanding Memorization Mechanisms: Further investigation into why and how near-verbatim risk structures emerge with model scaling and cross-dataset heterogeneity.
  • Extension to Other Decoding Policies: Adapting DCBS to non-top-ϕ\phi5 decoding, such as nucleus sampling, and incorporating other metrics for “semantic” similarity.

Conclusion

This work precisely quantifies the substantial, previously underestimated near-verbatim extraction risk in LLMs, via an efficient, deterministic, and scalable algorithmic framework. DCBS reveals that classical verbatim approaches significantly underrepresent both the prevalence and the magnitude of memorization risk, particularly as models and datasets scale. The methods introduced provide both a robust practical risk diagnostic and a theoretical lens on the evolution of memorization behavior in generative models.

Refer to (2603.24917) for further implementation and experimental details.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 10 likes about this paper.