Anytime-Valid Certified Robustness
- Anytime-valid certified robustness is a sequential testing paradigm that uses nonnegative martingales and e-processes to maintain uniform guarantees under optional stopping and adaptive sampling.
- It integrates methodologies across image classification, ASR, and RLVR deployments by converting noisy evaluations into dynamic confidence sequences and certified radii.
- This framework trades slight finite-sample tightness for compute-efficient, anytime-valid certification, ensuring robust, deployment-time risk control and improved practical safety guarantees.
Searching arXiv for the cited papers and closely related randomized smoothing work. arXiv search query: (Cullen et al., 26 Jun 2026) OR (Cullen et al., 26 Jun 2026) OR (Khosravi et al., 18 May 2026) OR randomized smoothing Cohen 2019 Anytime-valid certified robustness denotes a family of certification procedures in which robustness guarantees remain valid under optional stopping, adaptive sampling, and continuous monitoring. Across the settings considered in recent work, the common structure is a nonnegative martingale or supermartingale—typically an e-process or test martingale—combined with Ville’s inequality to obtain time-uniform control. In randomized smoothing for image classifiers, this yields confidence sequences for class probabilities and certified radii that can be updated after every noisy evaluation (Cullen et al., 26 Jun 2026). In automatic speech recognition (ASR), the same statistical machinery is used to certify token existence and adversarial exclusion, then to certify a sentence through a rank-based tournament over filtered candidates, producing a certified transcript and robustness radius under additive Gaussian noise (Cullen et al., 26 Jun 2026). In deployment-time control for RLVR-trained LLMs, an e-process per release threshold yields anytime-pathwise certificates that the verifier-measured failure rate among released outputs remains below a contractual risk budget up to a vanishing-in- slack (Khosravi et al., 18 May 2026).
1. Definition and problem setting
Certified robustness, in the classifier setting, means that for a classifier and norm , a sample is certified robust at radius if for all in the ball 0 (Cullen et al., 26 Jun 2026). Randomized smoothing instantiates this by defining a smoothed classifier
1
and then mapping lower and upper probability bounds to an 2 radius through Gaussian smoothing formulas (Cullen et al., 26 Jun 2026).
The same notion is generalized in sequence and deployment settings, but the certified object changes. For ASR, the final output is a certified transcript 3 together with a robustness radius 4 such that, with probability at least 5, 6 is invariant to any perturbation with 7 norm less than 8 (Cullen et al., 26 Jun 2026). For RLVR deployment, the certified object is not a per-input perturbation radius but a per-deployment guarantee on selective risk: at every time 9, the verifier-measured failure rate among released outputs is bounded by 0 plus a slack that shrinks with the number of acted rounds 1 (Khosravi et al., 18 May 2026).
A central motivation is that fixed-horizon procedures are poorly matched to adaptive deployments. Classical randomized smoothing requires large, precommitted sample sizes and suffers from the peeking problem if one stops early; offline conformal-risk methods depend on exchangeability; online-conformal methods control long-run averages rather than pathwise risk; and standard ASR evaluation is difficult in deployment because oracle transcripts are unavailable (Cullen et al., 26 Jun 2026, Khosravi et al., 18 May 2026, Cullen et al., 26 Jun 2026). Anytime-valid procedures replace fixed-horizon guarantees with uniform-in-time guarantees that tolerate monitoring and data-dependent stopping.
2. Statistical foundations: e-values, test martingales, and confidence sequences
The technical core is a nonnegative e-value or e-process. In the sequential setting, if 2 is an e-value at time 3, wealth is accumulated as
4
By Ville’s inequality,
5
so threshold crossing yields valid rejection at any time and therefore optional-stopping validity (Cullen et al., 26 Jun 2026). The same logic appears in the image-certification setting, where an e-process 6 is a nonnegative supermartingale under the null with 7 for all 8, and tests based on thresholding are anytime-valid (Cullen et al., 26 Jun 2026). In RLVR deployment, a Ville-type e-process is maintained for each release threshold on a Bonferroni grid, evaluated against the RLVR filtration, yielding pathwise guarantees on the realized stream without exchangeability and without pooling across deployments (Khosravi et al., 18 May 2026).
A second common ingredient is inversion. In ASR, likelihood-ratio martingales
9
are inverted to obtain anytime-valid lower and upper confidence sequence bounds 0 and 1 for token-inclusion probabilities 2 (Cullen et al., 26 Jun 2026). In sequential randomized smoothing, the set of nonrejected nulls at time 3 forms a confidence set
4
and the anytime-valid lower confidence bound is 5; mapping 6 through 7 yields a time-uniform certified radius (Cullen et al., 26 Jun 2026). In RLVR deployment, the inversion is threshold-oriented rather than probability-oriented: the certified set is the set of thresholds whose e-processes have crossed their Bonferroni-adjusted boundaries, and deployment uses the most permissive certified threshold (Khosravi et al., 18 May 2026).
| Setting | Certified object | Sequential mechanism |
|---|---|---|
| Randomized smoothing | lower confidence bound on 8 and radius 9 | mixture e-processes over Bernoulli trials |
| ASR | token existence/exclusion, winning sentence 0, end-to-end radius 1 | Two-Sided Atomic Audit and Rank-Based Tournament |
| RLVR deployment | per-deployment selective risk 2 | e-process per threshold with max-certified-threshold rule |
These constructions share a common interpretation: evidence is accumulated multiplicatively, stopping is triggered by wealth crossing, and the resulting certificate remains valid under peeking. This suggests that anytime-valid certified robustness is best understood not as a single robustness definition but as a sequential certification paradigm whose certified quantity depends on the prediction object and threat model.
3. Randomized smoothing as the canonical continuous-radius instance
The most direct instantiation is sequential randomized smoothing for image classification. For a fixed input 3 and target class 4, Bernoulli trials are defined by
5
with success probability 6 (Cullen et al., 26 Jun 2026). Classical randomized smoothing estimates 7 by Monte Carlo and then uses fixed-sample binomial confidence intervals such as Clopper–Pearson. The limitations described in recent work are that classic RS requires fixed horizons, often 8 noisy evaluations per input, and invalidates coverage under peeking (Cullen et al., 26 Jun 2026).
The anytime-valid alternative replaces fixed-sample confidence intervals with mixture e-processes. For a point null 9 and alternative 0, the pointwise likelihood-ratio e-value is
1
and the associated wealth process is
2
where 3 (Cullen et al., 26 Jun 2026). Since 4 is unknown, the method integrates over a prior 5, yielding a mixture e-process 6. With a Beta prior this has closed form through Beta functions, and the paper further uses a mixture of truncated Beta components to focus mass on regions most relevant for certification, such as the robust region 7 (Cullen et al., 26 Jun 2026).
A notable contribution is the use of a lightweight meta-learner to predict image-specific priors. The meta-learner 8 takes a penultimate-layer embedding 9, clean-image softmax summary, and a small Phase I glimpse of noisy evaluations, and outputs mixture weights 0, Beta parameters 1, and optionally truncation bounds 2 (Cullen et al., 26 Jun 2026). To guard against prior misspecification, the final e-process includes a fixed small weight 3 on the KT prior; by convexity of e-values, validity is preserved (Cullen et al., 26 Jun 2026).
The certification pipeline has two phases. Phase I identifies the target class and features, then discards those samples from the e-process to maintain supermartingale validity. Phase II updates the mixture e-process sequentially, inverts it to obtain 4, and computes 5 (Cullen et al., 26 Jun 2026). Because the procedure is anytime-valid, it supports precision-based stopping, plateau exit, target-radius stopping, an adversarial/UCB exit for quick rejection, and a bankruptcy exit for compute-saving conservative rejection (Cullen et al., 26 Jun 2026).
Empirically, the paper reports a 20-fold reduction in sample complexity compared to traditional methods while maintaining rigorous statistical guarantees, with viable certifications frequently completing in fewer than 6 samples (Cullen et al., 26 Jun 2026). Relative to a KT prior baseline, meta-learning reduces samples by 7–8 on average and tightens radii by up to about 9 on ImageNet; for non-robust samples, early rejection policies yield up to 0–1 speedups with only minor accuracy/radius penalties (Cullen et al., 26 Jun 2026). The paper also emphasizes an “anytime-valid tax”: anytime-valid intervals can be slightly wider than fixed-horizon Clopper–Pearson intervals for the same 2 (Cullen et al., 26 Jun 2026). A common misconception is therefore that anytime-validity is a free improvement over fixed-sample certification; the reported formulation instead trades some finite-3 tightness for optional-stopping validity and adaptive compute allocation.
4. Hierarchical certification for structured outputs in automatic speech recognition
For ASR, majority-class certification collapses in sequence spaces because probability mass fragments across many outputs. The proposed solution is hierarchical aggregation: certify atomic content first, then certify a sentence assembled from those atomic decisions (Cullen et al., 26 Jun 2026). Under additive white Gaussian noise 4, for a token 5 in the vocabulary 6, the inclusion probability is
7
A token existence certificate rejects the null 8, while an adversarial exclusion certificate rejects 9 (Cullen et al., 26 Jun 2026).
The Two-Sided Atomic Audit is a sequential, two-sided e-value or martingale audit over tokens discovered in an initial sample budget. With betting fraction 0 and token indicator 1, the updates are
2
3
initialized at 4 (Cullen et al., 26 Jun 2026). Tokens crossing 5 become part of the certified vocabulary 6 or the excluded vocabulary 7. Confidence sequences derived from likelihood-ratio martingales then map to token-level radii
8
and
9
with 0 (Cullen et al., 26 Jun 2026).
Sentence certification is handled by a Rank-Based Tournament. New noisy transcripts are filtered through 1, and the top-2 most frequent unique filtered sequences form the candidate set. At each step, the candidate with minimum WER to the current filtered sample is declared the winner, and candidate e-values are updated as
3
The average wealth
4
is a nonnegative martingale; stopping when it crosses 5 yields a sentence-level certificate without paying a multiplicity penalty linear in 6 (Cullen et al., 26 Jun 2026). The winner is 7, and the sentence-level radius is
8
The end-to-end radius is 9, with coverage at least 00 (Cullen et al., 26 Jun 2026).
The empirical setting comprises LibriSpeech and Common Voice 17.0 English; Whisper-Large-v3, Whisper-Small, HuBERT-Large, and Wav2Vec2-Large; AWGN at 01 dB; and budgets 02, 03, 04, 05 up to 06, with 07, 08, and 09 (Cullen et al., 26 Jun 2026). Reported outcomes include certification recall at 10 dB SNR ranging from 11–12 versus approximately 13–14 for baselines, and a Whisper-Large-v3 WER reduction from 15 to 16, approximately a 17–18 relative reduction (Cullen et al., 26 Jun 2026). Certified radius is reported to correlate inversely with WER and to remain informative when baselines’ recall collapses to zero; anytime-stopping yields up to 19 compute savings versus fixed-budget on autoregressive models and is 20–21 faster than ROVER for CTC models (Cullen et al., 26 Jun 2026).
The ASR formulation also clarifies an important distinction. The robustness claim is with respect to an AWGN smoothing model and is mapped to 22 perturbations; the paper explicitly notes that real-world perturbations may be non-Gaussian or psychoacoustic, and robustness may not fully reflect human perception (Cullen et al., 26 Jun 2026). Another limitation is multiplicity: atomic decisions use per-token thresholds at 23, giving local guarantees but not family-wise error control across the vocabulary (Cullen et al., 26 Jun 2026).
5. Deployment-time selective-risk certification for RLVR-trained LLMs
A second major line of work extends anytime-valid certified robustness from perturbation robustness to deployment-time risk control. In this setting, a specialist LLM is fine-tuned with reinforcement learning from verifiable rewards on operator-local data, a deterministic verifier returns a binary signal 24 at each round, and a gate releases or abstains (Khosravi et al., 18 May 2026). The operator requires that, on this deployment stream and at every time, the verifier-measured failure rate among released outputs does not exceed a contractual budget 25, up to a small slack that vanishes with the number of released outputs (Khosravi et al., 18 May 2026).
The central statistic is the gated excess-risk increment
26
where 27 is the release rule induced by threshold 28 on a predictable calibrated score 29 (Khosravi et al., 18 May 2026). Under the unsafe null at threshold 30, the conditional mean of 31 is nonnegative. On epoch 32, the e-process is
33
with predictable betting fraction 34 clipped to 35 (Khosravi et al., 18 May 2026). Proposition-level validity states that under the unsafe null and predictability, 36 is a nonnegative supermartingale, so Ville’s inequality gives
37
CSA maintains such an e-process for each threshold in a finite grid 38, with Bonferroni allocations such as 39 (Khosravi et al., 18 May 2026). Under monotonicity, the safe set is downward-closed in 40, so deployment uses the most permissive certified threshold:
41
abstaining if the certified set is empty (Khosravi et al., 18 May 2026). The selective risk is
42
The main theoretical guarantee is an anytime-pathwise selective-risk bound, uniformly for all horizons 43:
44
in the exact-stability regime, with a more detailed bound that includes logarithmic factors and an epoch-wise drift pad in the general case (Khosravi et al., 18 May 2026). Additional results include rate-optimal certification time,
45
and a matching lower bound of 46, as well as a horizon-independent action-gap bound under stability (Khosravi et al., 18 May 2026).
The empirical evidence spans eight specialist benchmarks comprising 47 streams, sixteen adversarial distribution-shift cells comprising 48 streams, and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families comprising 49 rounds (Khosravi et al., 18 May 2026). The reported result is that CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell; on MedQA, HEAD-QA, ARC-Challenge, CaseHOLD, and MATH, CSA achieves 50 with action rates at least 51 whenever the budget admits acting, and under live HEAD-QA with Med42-8B and 52, CSA attains 53 with 54 (Khosravi et al., 18 May 2026).
This deployment perspective broadens the meaning of certified robustness. The paper explicitly distinguishes it from adversarial robustness certificates: the guarantee is distribution-free selective risk over the realized adaptive stream, induced by test supermartingales and evaluated against the RLVR filtration, rather than worst-case per-input invariance (Khosravi et al., 18 May 2026). A common misconception is therefore that “certified robustness” always refers to perturbation balls around single inputs; in the deployment setting, the certified object is the pathwise safety of the release policy.
6. Assumptions, limitations, and broader implications
Anytime-valid certified robustness depends critically on explicit modeling assumptions. In ASR, the threat model is additive white Gaussian noise with independent samples, and the probability-to-radius conversion uses the Cohen-style smoothing formula with Gaussian noise; the resulting robustness is mapped to 55 (Cullen et al., 26 Jun 2026). In sequential randomized smoothing for images, validity is with respect to the Bernoulli sampling model induced by noisy evaluations of a fixed target class, and the continuous radius is derived from a lower confidence bound on 56 (Cullen et al., 26 Jun 2026). In CSA for RLVR deployment, guarantees rely on predictable updates, monotone risk via isotonic calibration, nested gates, and bounded within-epoch frontier drift for no-false-certification (Khosravi et al., 18 May 2026).
Several limitations recur across domains. The ASR framework notes that unseen tokens in the discovery phase cannot be certified, that larger 57 increases recall at extra cost, that aggressive betting fractions can increase variance, and that nouns, verbs, and proper nouns have lower raw accuracy and certification margins (Cullen et al., 26 Jun 2026). The image-certification framework notes the anytime-valid tax, the need for an offline dataset to train priors, and the possibility that bankruptcy exits can conservatively reject borderline robust cases (Cullen et al., 26 Jun 2026). The RLVR deployment framework notes that guarantees are only with respect to the deterministic verifier 58, so harms beyond the verifier’s scope must be addressed by other safeguards (Khosravi et al., 18 May 2026).
The papers also identify important methodological contrasts. Classical randomized smoothing is fixed-horizon and majority-class based; sequence outputs require hierarchical aggregation rather than majority voting (Cullen et al., 26 Jun 2026, Cullen et al., 26 Jun 2026). Conformal sets provide coverage under exchangeability, but standard conformal is not inherently anytime-valid and does not directly certify adversarial invariance radii; offline conformal-risk wrappers use fixed calibration sets and online conformal methods control long-run averages rather than simultaneous control over all 59 (Cullen et al., 26 Jun 2026, Khosravi et al., 18 May 2026). Interval Bound Propagation and formal verification work well for image classifiers and regression network outputs but do not directly scale to high-dimensional discrete sequences, are not anytime-valid by default, and require architectural constraints (Cullen et al., 26 Jun 2026).
The broader implication, stated most explicitly in the ASR work, is that betting-based tests, e-values, and confidence sequences provide uniform validity over time and immunity to peeking, enabling practical, compute-aware certification pipelines for complex, high-dimensional outputs and delivering actionable trust scores without ground-truth during deployment (Cullen et al., 26 Jun 2026). The same paper proposes direct generalization to NLP, vision, and multimodal settings by redefining the atomic units—subwords or words, patches or objects, modality-specific units—and replacing WER with task-appropriate structural losses while preserving optional-stopping validity via martingales and e-values (Cullen et al., 26 Jun 2026). This suggests a unifying research program: anytime-valid certified robustness is a sequential statistical interface between model outputs and operational decisions, with perturbation certificates, sequence certificates, and deployment-risk certificates as domain-specific realizations of the same time-uniform testing principle.