Antidistillation Fingerprinting (ADFP)
- Antidistillation Fingerprinting (ADFP) is a method that embeds detectable signatures in student models trained on teacher outputs via gradient-driven logit perturbations.
- It leverages a proxy model to align fingerprint signals with expected student updates, thereby enhancing detection accuracy over heuristic watermarking techniques.
- Empirical evaluations on benchmarks show ADFP achieves superior statistical performance and minimal loss of model utility compared to traditional methods.
Antidistillation Fingerprinting (ADFP) is a statistically grounded approach for robustly detecting whether a machine learning model—or LLM in particular—has been trained, wholly or in part, on outputs sampled from a specified “teacher” model. In contrast to earlier heuristic logit perturbation or token-biasing schemes, ADFP constructs fingerprints by aligning the fingerprinting signal with the expected learning dynamics of a distillation-trained student model. This paradigm is motivated by the need for verifiable attribution in the context of LLMs and is operationalized via gradient-based manipulation of generation distributions to maximize post-distillation detectability without compromising model utility (Xu et al., 3 Feb 2026).
1. Theoretical Objective and Methodological Distinction
The central objective of Antidistillation Fingerprinting is to generate training data from a teacher model that, when ingested by a student model via distillation, results in a persistent statistical signature detectable via black-box queries. Formally, let be the data distribution, the teacher, a student fine-tuned on teacher outputs, and the key-dependent “green list” of tokens for prompt under owner secret , with the vocabulary. The detection statistic is the per-context green-token probability:
where is the probability of token under .
Distilling the fingerprinting objective, ADFP seeks logit perturbations such that, after sampling with these perturbations and subsequent fine-tuning of a student on the resulting data, the expected detectability in is maximized:
where is the teacher’s logit vector and is the sampling temperature.
Distinctly, prior watermark or fingerprint schemes (e.g., red-green list watermarks [Kirchenbauer et al. 2023]) apply fixed logit boosts to “green” tokens, ignoring the effect on the student’s parameter updates. ADFP instead applies gradient-based logit perturbations derived to maximize green-list detectability post-distillation, explicitly aligning the perturbation direction with anticipated parameter shifts (Xu et al., 3 Feb 2026).
2. Gradient-Based Fingerprint Perturbation Derivation
Building on the Antidistillation Sampling (ADS) framework [Savani et al. 2025], ADFP employs a proxy model that approximates the student’s update trajectory during fine-tuning. Given context , define and . For green-list , the instantaneous fingerprint loss is .
To maximize this quantity with respect to student updates, the optimal per-token perturbation aligns with the dot product of gradients:
Under an isotropic approximation of the intermediate Jacobian, this admits the closed form:
Thus, tokens in the green list with high conditional probability under the proxy are preferentially boosted; outside tokens are slightly suppressed. This construction explicitly targets the tokens whose increased teacher logit most rapidly amplify the persistent signature after fine-tuning.
3. ADFP Sampling Algorithm
The operational ADFP sampling method proceeds as follows:
1 2 3 4 5 6 7 |
Input: Teacher θₜ, Proxy θₚ, context x₁:ₗ, key k, penalty λ, temp τ, window w
1. Compute green-list S ← H(x_{l−w+1:l}, k)
2. Query proxy: q ← softmax(z(·|x₁:ₗ;θₚ))
3. Compute L ← ∑_{t∈S} q_t; for all t in V: Δ^{ADS}_t ← q_t (1_{t∈S} − L)
4. Query teacher logits: zₜ ← z(·|x₁:ₗ;θₜ)
5. Perturb & sample: ẑ ← zₜ + λ Δ^{ADS}; x_{l+1} ∼ Softmax(ẑ/τ)
6. Return x_{l+1} |
Hyperparameters include the penalty λ (regulating fingerprint strength), temperature τ, and window size w for green-list computation (Xu et al., 3 Feb 2026).
4. Detection Protocol and Statistical Test
Given a (possibly closed-weight) student model , detection is conducted by evaluating the mean green-token probability over an evaluation set :
Under the null hypothesis that the student has not absorbed the fingerprint (i.e., green-list membership is random), the mean is ; deviations can be quantified with Hoeffding's inequality:
where is the observed GTP. Thresholding yields a statistically controlled false-positive rate for the declaration “model trained on fingerprinted data.”
5. Empirical Performance and Pareto Analysis
Experimental results on GSM8K and OASST1 benchmarks validate ADFP’s efficacy and utility trade-off. Compared to red-green list baselines, ADFP consistently yields lower -values (stronger detection) at the same or higher answer accuracy or negative log-likelihood (minimizing utility loss). For instance, on GSM8K with a closed-weight student (Distil-Qwen2.5-3B), at teacher accuracy , the baseline achieves , while ADFP yields . Further, at false positive rate, ADFP achieves a true positive rate of vs the baseline’s , and area under the ROC curve (AUC) improvements of $0.10$–$0.15$ are observed consistently.
A selection of resulting metrics:
| Setting | Teacher Acc | Baseline p-value | ADFP p-value |
|---|---|---|---|
| λ/δ = 7 (GSM8K closed) | 40% | ||
| λ/δ = 14 | 28% |
These results substantiate the claimed Pareto improvement: fingerprint detectability is strengthened without an equivalent sacrifice in generative quality.
6. Limitations, Critical Assumptions, and Extensions
- Proxy-student mismatch: The accuracy of ADFP depends on the fidelity of the proxy in approximating the actual student. Divergence in architecture or learning rate can diminish detectability. Ensemble proxies or post-hoc adaptive refinement are possible mitigations.
- Domain shift: If the semantic effect of the fingerprint (green-list token probabilities) changes across domains seeded by the teacher and student, detectability may deteriorate. Context-dependent fingerprinting or higher-level functional fingerprints are possible extensions.
- Partial fingerprinting: ADFP remains robust with down to approximately of training data fingerprinted; below this, detection rates decrease. Adaptive thresholds or mixed-key stratagems could extend coverage.
- Adversarial “unlearning”: A determined student could apply adversarial fine-tuning to attenuate the fingerprint. This poses an open challenge in defending persistent attribution schemes.
- One-step update approximation: The present methodology employs a first-order surrogate for learning dynamics; considering multi-step or higher-order gradients may further enhance the method’s precision or resilience.
This suggests that ADFP occupies a robust niche for reliable black-box fingerprinting of distillation-extracted models under realistic, non-collusive threat models—with both utility and statistical power outperforming prior token-perturbation approaches (Xu et al., 3 Feb 2026).
7. Connections to Prior Fingerprinting and Conferrable Adversarial Examples
Earlier fingerprinting work in neural networks for model extraction detection, notably Conferrable-Example Fingerprinting (CEF) (Lukas et al., 2019), operates on the generation of conferrable adversarial examples. These targeted adversarial instances are “conferrable” in that they transfer from the source to surrogates but not to references independently trained on ground truth. The fingerprint is constructed by optimizing a composite objective so that only surrogates—typically produced by distillation or retraining on teacher labels—exhibit the misclassification on these inputs. The CEF/ADFP method achieves perfect separability (ROC AUC ) under a wide array of distillation and model modification attacks and substantiates the “antidistillation” moniker by demonstrating resilience to knowledge-distillation-style model-extraction attacks. ADFP extends this paradigm into the sequence modeling and LLM context with sampling-aligned, training-dynamic-aware fingerprints, preserving detection efficacy even as models and domains grow in scale and heterogeneity (Lukas et al., 2019, Xu et al., 3 Feb 2026).