Min-K% Probability Analysis
- Min-K% Probability Analysis is a framework that quantifies the lowest k% outcomes using order statistics to assess risk and detect anomalies.
- It utilizes tight concentration inequalities and quantile estimation techniques to provide high-confidence performance guarantees.
- The framework applies to diverse settings including language models, discrete distributions, and robust estimation for both model behavior and risk control.
Min-K% Probability Analysis is a statistical framework focused on the analysis and high-confidence quantile estimation of the minimum of a collection of random variables, typically using order statistics such as the minimum of i.i.d. samples or the minima of token-level log-probabilities in autoregressive models. Its central concept is to quantify the most extreme (lowest) fraction—namely, the worst-performing k%—of a sequence of probabilistic outcomes for robust risk and anomaly assessment, pre-training data detection, and reliable high-probability guarantees.
1. Mathematical Formulation and Core Principles
The Min-K% criterion evaluates the behavior of the lowest k% order statistics. Given a sequence (e.g., token-level log-likelihoods), the Min-K% score is the mean of the lowest m values, where and the order statistics are sorted. For log-probabilities in LLMs:
where are the sorted log-probabilities. This technique generalizes to other settings, including coverage of quantiles and the minimum across i.i.d. random variables. In probabilistic estimation, "Min-K% Probability Analysis" refers to the explicit control and quantification of the smallest (worst) outcomes, and their deviation probabilities, via information-theoretic bounds.
2. Concentration Inequalities for the Minimum
Tight finite-sample bounds on the minimum of i.i.d. random variables, particularly binomial or general discrete/continuous distributions, underpin rigorous Min-K% analysis. Let be independent , and . Explicit nonasymptotic high-probability bounds can be derived as follows (Zhu et al., 25 Feb 2025):
- Lower bound: For threshold , with ,
0
- Upper bound:
1
where 2 is the binary KL divergence and 3.
For quantile-based selection, identify 4 corresponding to the 5-quantile of 6 via 7. This framework generalizes: for any collection of i.i.d. variables, the Min-K% behavior is tightly controlled by Sanov-type and Chernoff-style exponential bounds in terms of KL divergence.
3. Min-K% in Statistical Estimation: High-Confidence Quantiles and Missing Mass
In distribution estimation, Min-K% analysis allows for robust large-deviation quantile control. For the missing mass 8 (the remaining probability of symbols not observed in a sample from a discrete distribution), explicit quantile bounds follow from variance-sensitive large-deviation inequalities (Berend et al., 2012):
9
ensuring 0, thus controlling the upper Min-K% quantiles. This principle extends to other order statistics, quantiles, and risk measures for high-probability guarantees.
4. Min-K% for Model Behavior: Memorization and Outlier Detection
The Min-K% statistic has been adopted for pre-training data detection in LLMs. For each position 1, using the model conditional distribution 2, Min-K% selects the lowest-3 log-likelihoods across a sequence. Empirically, these measure the model's weakest predictions, which are highly indicative of whether a sample was seen during training (Zhang et al., 2024).
- Min-K%++ Extension: Advances beyond averaging the lowest log-probabilities by normalizing token log-probs against the context’s conditional mean and variance, i.e.,
4
This discrete curvature score sharply identifies local maxima ("memorized" points). The Min-K%++ method shows substantial AUROC improvements in training- vs. non-training data detection benchmarks.
| Method | Statistic | Application |
|---|---|---|
| Min-K% | Worst-5 log-probs | Non-training data identification |
| Min-K%++ | Curvature-normalized min | Mode/memorization detection |
These Min-K% analytics are robust to context variability and reveal sharp boundaries between memorized and non-memorized sequences.
5. Universal Lower Bounds and Sharpness in Continuous Laws
In continuous probability contexts, universal Min-K% lower bounds quantify how likely it is that, conditional on an extreme event (sum exceeding a threshold), the minimum of a set takes small values. For i.i.d. 6 with continuous density 7 and median 8,
9
This logarithmic denominator is unavoidable—sharpened constructions prove the bound is optimal up to constants (Steinerberger, 2018).
6. High-Probability Minimax Rates in Discrete Estimation
In large-deviation regimes, Min-K% quantile control is essential in establishing high-probability minimax lower bounds. For estimating a discrete distribution 0 of support size 1 from 2 samples, the minimax lower bound for the KL risk at confidence level 3 is
4
No estimator can beat this rate for the Min-K% quantile of the KL loss (Hoeven et al., 23 Jul 2025). Efficient algorithms such as OTB (Online-to-Batch) with suffix averaging achieve matching upper bounds up to additional 5 factors. The penalty in controlling rare Min-K% probability events accounts for the additional sample complexity compared to expected-risk settings.
| Setting | Min-K% Rate | Reference |
|---|---|---|
| Expected KL risk | 6 | (Hoeven et al., 23 Jul 2025) |
| High-prob Min-K% KL quantile | 7 | (Hoeven et al., 23 Jul 2025) |
7. Connections and Open Problems
Min-K% probability analysis bridges order statistics, large deviations, concentration of measure, and robust estimation. Its methodology elucidates phenomena in both discrete (model quantiles, missing mass, high-confidence tail behavior) and continuous (scale-invariant conditional minima) probabilistic systems. Open questions remain regarding optimal Min-K% bounds for sums of independent random variables, weightings in scale-dependent inequalities, and tightness classes for particular distribution shapes (Steinerberger, 2018).
The Min-K% paradigm thus delivers a general toolkit for quantifying tail events, calibrating risk, and detecting outlier or memorized behavior in high-dimensional statistical and machine learning settings, supported by sharp probabilistic inequalities and algorithmic adaptivity.