Min-K% Probability Analysis
- Min-K% Probability Analysis is a framework that quantifies the lowest k% outcomes using order statistics to assess risk and detect anomalies.
- It utilizes tight concentration inequalities and quantile estimation techniques to provide high-confidence performance guarantees.
- The framework applies to diverse settings including language models, discrete distributions, and robust estimation for both model behavior and risk control.
Min-K% Probability Analysis is a statistical framework focused on the analysis and high-confidence quantile estimation of the minimum of a collection of random variables, typically using order statistics such as the minimum of i.i.d. samples or the minima of token-level log-probabilities in autoregressive models. Its central concept is to quantify the most extreme (lowest) fraction—namely, the worst-performing k%—of a sequence of probabilistic outcomes for robust risk and anomaly assessment, pre-training data detection, and reliable high-probability guarantees.
1. Mathematical Formulation and Core Principles
The Min-K% criterion evaluates the behavior of the lowest k% order statistics. Given a sequence (e.g., token-level log-likelihoods), the Min-K% score is the mean of the lowest m values, where %%%%1%%%% and the order statistics are sorted. For log-probabilities in LLMs:
where are the sorted log-probabilities. This technique generalizes to other settings, including coverage of quantiles and the minimum across i.i.d. random variables. In probabilistic estimation, "Min-K% Probability Analysis" refers to the explicit control and quantification of the smallest (worst) outcomes, and their deviation probabilities, via information-theoretic bounds.
2. Concentration Inequalities for the Minimum
Tight finite-sample bounds on the minimum of i.i.d. random variables, particularly binomial or general discrete/continuous distributions, underpin rigorous Min-K% analysis. Let be independent , and . Explicit nonasymptotic high-probability bounds can be derived as follows (Zhu et al., 25 Feb 2025):
- Lower bound: For threshold , with ,
- Upper bound:
where is the binary KL divergence and .
For quantile-based selection, identify corresponding to the -quantile of via . This framework generalizes: for any collection of i.i.d. variables, the Min-K% behavior is tightly controlled by Sanov-type and Chernoff-style exponential bounds in terms of KL divergence.
3. Min-K% in Statistical Estimation: High-Confidence Quantiles and Missing Mass
In distribution estimation, Min-K% analysis allows for robust large-deviation quantile control. For the missing mass (the remaining probability of symbols not observed in a sample from a discrete distribution), explicit quantile bounds follow from variance-sensitive large-deviation inequalities (Berend et al., 2012):
ensuring , thus controlling the upper Min-K% quantiles. This principle extends to other order statistics, quantiles, and risk measures for high-probability guarantees.
4. Min-K% for Model Behavior: Memorization and Outlier Detection
The Min-K% statistic has been adopted for pre-training data detection in LLMs. For each position , using the model conditional distribution , Min-K% selects the lowest- log-likelihoods across a sequence. Empirically, these measure the model's weakest predictions, which are highly indicative of whether a sample was seen during training (Zhang et al., 2024).
- Min-K%++ Extension: Advances beyond averaging the lowest log-probabilities by normalizing token log-probs against the context’s conditional mean and variance, i.e.,
This discrete curvature score sharply identifies local maxima ("memorized" points). The Min-K%++ method shows substantial AUROC improvements in training- vs. non-training data detection benchmarks.
| Method | Statistic | Application |
|---|---|---|
| Min-K% | Worst- log-probs | Non-training data identification |
| Min-K%++ | Curvature-normalized min | Mode/memorization detection |
These Min-K% analytics are robust to context variability and reveal sharp boundaries between memorized and non-memorized sequences.
5. Universal Lower Bounds and Sharpness in Continuous Laws
In continuous probability contexts, universal Min-K% lower bounds quantify how likely it is that, conditional on an extreme event (sum exceeding a threshold), the minimum of a set takes small values. For i.i.d. with continuous density and median ,
This logarithmic denominator is unavoidable—sharpened constructions prove the bound is optimal up to constants (Steinerberger, 2018).
6. High-Probability Minimax Rates in Discrete Estimation
In large-deviation regimes, Min-K% quantile control is essential in establishing high-probability minimax lower bounds. For estimating a discrete distribution of support size from samples, the minimax lower bound for the KL risk at confidence level is
No estimator can beat this rate for the Min-K% quantile of the KL loss (Hoeven et al., 23 Jul 2025). Efficient algorithms such as OTB (Online-to-Batch) with suffix averaging achieve matching upper bounds up to additional factors. The penalty in controlling rare Min-K% probability events accounts for the additional sample complexity compared to expected-risk settings.
| Setting | Min-K% Rate | Reference |
|---|---|---|
| Expected KL risk | (Hoeven et al., 23 Jul 2025) | |
| High-prob Min-K% KL quantile | (Hoeven et al., 23 Jul 2025) |
7. Connections and Open Problems
Min-K% probability analysis bridges order statistics, large deviations, concentration of measure, and robust estimation. Its methodology elucidates phenomena in both discrete (model quantiles, missing mass, high-confidence tail behavior) and continuous (scale-invariant conditional minima) probabilistic systems. Open questions remain regarding optimal Min-K% bounds for sums of independent random variables, weightings in scale-dependent inequalities, and tightness classes for particular distribution shapes (Steinerberger, 2018).
The Min-K% paradigm thus delivers a general toolkit for quantifying tail events, calibrating risk, and detecting outlier or memorized behavior in high-dimensional statistical and machine learning settings, supported by sharp probabilistic inequalities and algorithmic adaptivity.