Papers
Topics
Authors
Recent
Search
2000 character limit reached

Min-K% Probability Analysis

Updated 28 January 2026
  • Min-K% Probability Analysis is a framework that quantifies the lowest k% outcomes using order statistics to assess risk and detect anomalies.
  • It utilizes tight concentration inequalities and quantile estimation techniques to provide high-confidence performance guarantees.
  • The framework applies to diverse settings including language models, discrete distributions, and robust estimation for both model behavior and risk control.

Min-K% Probability Analysis is a statistical framework focused on the analysis and high-confidence quantile estimation of the minimum of a collection of random variables, typically using order statistics such as the minimum of i.i.d. samples or the minima of token-level log-probabilities in autoregressive models. Its central concept is to quantify the most extreme (lowest) fraction—namely, the worst-performing k%—of a sequence of probabilistic outcomes for robust risk and anomaly assessment, pre-training data detection, and reliable high-probability guarantees.

1. Mathematical Formulation and Core Principles

The Min-K% criterion evaluates the behavior of the lowest k% order statistics. Given a sequence {Z1,...,ZT}\{Z_1, ..., Z_T\} (e.g., token-level log-likelihoods), the Min-K% score is the mean of the lowest m values, where %%%%1%%%% and the order statistics Z(1)Z(T)Z_{(1)} \le \dots \le Z_{(T)} are sorted. For log-probabilities in LLMs:

Min-K%k(x)=1mj=1m(j)\text{Min-K\%}_k(x) = \frac{1}{m} \sum_{j=1}^m \ell_{(j)}

where (j)\ell_{(j)} are the sorted log-probabilities. This technique generalizes to other settings, including coverage of quantiles and the minimum across i.i.d. random variables. In probabilistic estimation, "Min-K% Probability Analysis" refers to the explicit control and quantification of the smallest (worst) outcomes, and their deviation probabilities, via information-theoretic bounds.

2. Concentration Inequalities for the Minimum

Tight finite-sample bounds on the minimum of i.i.d. random variables, particularly binomial or general discrete/continuous distributions, underpin rigorous Min-K% analysis. Let X1,...,XmX_1, ..., X_m be independent Bin(n,p)\mathrm{Bin}(n,p), and M=miniXiM = \min_i X_i. Explicit nonasymptotic high-probability bounds can be derived as follows (Zhu et al., 25 Feb 2025):

  • Lower bound: For threshold kk, with t=k/nt = k/n,

Pr[Mk]1menD(tp)\Pr[M \geq k] \geq 1 - m\,e^{-n\,D(t\|p)}

  • Upper bound:

Pr[M<k]1(1enD(tp)Cn)m\Pr[M < k] \leq 1 - \left(1 - e^{-n\,D(t\|p)-C_n} \right)^m

where D(tp)D(t\|p) is the binary KL divergence and Cn=4log(n+1)+[logp1p]+C_n = 4\log(n+1) + [\log\frac{p}{1-p}]_+.

For quantile-based selection, identify kk corresponding to the (1q)(1-q)-quantile of Bin(n,p)\mathrm{Bin}(n,p) via D(k/np)log(1/q)nD(k/n \| p) \approx \frac{\log(1/q)}{n}. This framework generalizes: for any collection of i.i.d. variables, the Min-K% behavior is tightly controlled by Sanov-type and Chernoff-style exponential bounds in terms of KL divergence.

3. Min-K% in Statistical Estimation: High-Confidence Quantiles and Missing Mass

In distribution estimation, Min-K% analysis allows for robust large-deviation quantile control. For the missing mass UnU_n (the remaining probability of symbols not observed in a sample from a discrete distribution), explicit quantile bounds follow from variance-sensitive large-deviation inequalities (Berend et al., 2012):

mk=EUn+1nln1δm_k = \mathbb{E} U_n + \sqrt{\frac{1}{n} \ln\frac{1}{\delta}}

ensuring P(Unmk)δP(U_n \geq m_k) \leq \delta, thus controlling the upper Min-K% quantiles. This principle extends to other order statistics, quantiles, and risk measures for high-probability guarantees.

4. Min-K% for Model Behavior: Memorization and Outlier Detection

The Min-K% statistic has been adopted for pre-training data detection in LLMs. For each position tt, using the model conditional distribution P(x<t)P(\cdot | x_{<t}), Min-K% selects the lowest-k%k\% log-likelihoods across a sequence. Empirically, these measure the model's weakest predictions, which are highly indicative of whether a sample was seen during training (Zhang et al., 2024).

  • Min-K%++ Extension: Advances beyond averaging the lowest log-probabilities by normalizing token log-probs against the context’s conditional mean and variance, i.e.,

st=logP(xtx<t)μx<tσx<ts_t = \frac{\log P(x_t|x_{<t}) - \mu_{x_{<t}}}{\sigma_{x_{<t}}}

This discrete curvature score sharply identifies local maxima ("memorized" points). The Min-K%++ method shows substantial AUROC improvements in training- vs. non-training data detection benchmarks.

Method Statistic Application
Min-K% Worst-k%k\% log-probs Non-training data identification
Min-K%++ Curvature-normalized min Mode/memorization detection

These Min-K% analytics are robust to context variability and reveal sharp boundaries between memorized and non-memorized sequences.

5. Universal Lower Bounds and Sharpness in Continuous Laws

In continuous probability contexts, universal Min-K% lower bounds quantify how likely it is that, conditional on an extreme event (sum exceeding a threshold), the minimum of a set takes small values. For i.i.d. (X,Y)(X, Y) with continuous density ff and median med(X)\operatorname{med}(X),

supz>0P(min(X,Y)zX+Y2z)124+8log2(med(X)fL)\sup_{z > 0} \mathbb{P}\bigl(\min(X,Y)\leq z \mid X+Y \geq 2z\bigr) \geq \frac{1}{24 + 8\log_2(\operatorname{med}(X)\|f\|_{L^\infty})}

This logarithmic denominator is unavoidable—sharpened constructions prove the bound is optimal up to constants (Steinerberger, 2018).

6. High-Probability Minimax Rates in Discrete Estimation

In large-deviation regimes, Min-K% quantile control is essential in establishing high-probability minimax lower bounds. For estimating a discrete distribution pp of support size KK from nn samples, the minimax lower bound for the KL risk at confidence level δ\delta is

Ω(max{K,lnKln(1/δ)}n)\Omega\left(\frac{\max\{K,\,\ln K\ln(1/\delta)\}}{n}\right)

No estimator can beat this rate for the Min-K% quantile of the KL loss (Hoeven et al., 23 Jul 2025). Efficient algorithms such as OTB (Online-to-Batch) with suffix averaging achieve matching upper bounds up to additional lnlnK\ln\ln K factors. The penalty in controlling rare Min-K% probability events accounts for the additional sample complexity compared to expected-risk settings.

Setting Min-K% Rate Reference
Expected KL risk Θ(K/n)\Theta(K/n) (Hoeven et al., 23 Jul 2025)
High-prob Min-K% KL quantile max{K,lnKln(1/δ)}/n\max\{K, \ln K \ln(1/\delta)\}/n (Hoeven et al., 23 Jul 2025)

7. Connections and Open Problems

Min-K% probability analysis bridges order statistics, large deviations, concentration of measure, and robust estimation. Its methodology elucidates phenomena in both discrete (model quantiles, missing mass, high-confidence tail behavior) and continuous (scale-invariant conditional minima) probabilistic systems. Open questions remain regarding optimal Min-K% bounds for sums of independent random variables, weightings in scale-dependent inequalities, and tightness classes for particular distribution shapes (Steinerberger, 2018).

The Min-K% paradigm thus delivers a general toolkit for quantifying tail events, calibrating risk, and detecting outlier or memorized behavior in high-dimensional statistical and machine learning settings, supported by sharp probabilistic inequalities and algorithmic adaptivity.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Min-K% Probability Analysis.