Papers
Topics
Authors
Recent
Search
2000 character limit reached

Antidistillation Fingerprinting (ADFP)

Updated 5 February 2026
  • Antidistillation Fingerprinting (ADFP) is a method that embeds detectable signatures in student models trained on teacher outputs via gradient-driven logit perturbations.
  • It leverages a proxy model to align fingerprint signals with expected student updates, thereby enhancing detection accuracy over heuristic watermarking techniques.
  • Empirical evaluations on benchmarks show ADFP achieves superior statistical performance and minimal loss of model utility compared to traditional methods.

Antidistillation Fingerprinting (ADFP) is a statistically grounded approach for robustly detecting whether a machine learning model—or LLM in particular—has been trained, wholly or in part, on outputs sampled from a specified “teacher” model. In contrast to earlier heuristic logit perturbation or token-biasing schemes, ADFP constructs fingerprints by aligning the fingerprinting signal with the expected learning dynamics of a distillation-trained student model. This paradigm is motivated by the need for verifiable attribution in the context of LLMs and is operationalized via gradient-based manipulation of generation distributions to maximize post-distillation detectability without compromising model utility (Xu et al., 3 Feb 2026).

1. Theoretical Objective and Methodological Distinction

The central objective of Antidistillation Fingerprinting is to generate training data from a teacher model that, when ingested by a student model via distillation, results in a persistent statistical signature detectable via black-box queries. Formally, let 𝒟𝒟 be the data distribution, θtθ_t the teacher, θsθ_s a student fine-tuned on teacher outputs, and H(x,k)VH(x,k) \subseteq V the key-dependent “green list” of tokens for prompt xx under owner secret kk, with VV the vocabulary. The detection statistic is the per-context green-token probability:

D(θ;x,k)=tH(x,k)qθ(tx)D(θ;x,k) = \sum_{t \in H(x,k)} q_θ(t|x)

where qθ(tx)q_θ(t|x) is the probability of token tt under θθ.

Distilling the fingerprinting objective, ADFP seeks logit perturbations ΔΔ such that, after sampling with these perturbations and subsequent fine-tuning of a student on the resulting data, the expected detectability in θsθ_s is maximized:

Ex𝒟EtSoftmax((zt(x)+Δ(x))/τ)[D(θs(Δ);x,k)]\mathbb{E}_{x \sim 𝒟} \mathbb{E}_{t \sim \text{Softmax}((z_t(x)+Δ(x))/\tau)} [ D(θ_s(Δ);x,k) ]

where zt(x)z_t(x) is the teacher’s logit vector and τ\tau is the sampling temperature.

Distinctly, prior watermark or fingerprint schemes (e.g., red-green list watermarks [Kirchenbauer et al. 2023]) apply fixed logit boosts to “green” tokens, ignoring the effect on the student’s parameter updates. ADFP instead applies gradient-based logit perturbations derived to maximize green-list detectability post-distillation, explicitly aligning the perturbation direction with anticipated parameter shifts (Xu et al., 3 Feb 2026).

2. Gradient-Based Fingerprint Perturbation Derivation

Building on the Antidistillation Sampling (ADS) framework [Savani et al. 2025], ADFP employs a proxy model θpθ_p that approximates the student’s update trajectory during fine-tuning. Given context xx, define z=z(x;θp)z = z(·|x;θ_p) and q=softmax(z)q = \text{softmax}(z). For green-list S=H(x,k)S = H(x,k), the instantaneous fingerprint loss is L(x)=tSqtL(x) = \sum_{t \in S} q_t.

To maximize this quantity with respect to student updates, the optimal per-token perturbation ΔtΔ_t aligns with the dot product of gradients:

Δtθplogqt,θpLΔ_t \propto \left\langle \nabla_{θ_p} \log q_t, \nabla_{θ_p} L \right\rangle

Under an isotropic approximation of the intermediate Jacobian, this admits the closed form:

ΔtADS=qt(1tSL)Δ^{ADS}_t = q_t (1_{t \in S} - L)

Thus, tokens in the green list with high conditional probability under the proxy are preferentially boosted; outside tokens are slightly suppressed. This construction explicitly targets the tokens whose increased teacher logit most rapidly amplify the persistent signature after fine-tuning.

3. ADFP Sampling Algorithm

The operational ADFP sampling method proceeds as follows:

1
2
3
4
5
6
7
Input: Teacher θₜ, Proxy θₚ, context x₁:ₗ, key k, penalty λ, temp τ, window w
1. Compute green-list S ← H(x_{l−w+1:l}, k)
2. Query proxy: q ← softmax(z(·|x₁:ₗ;θₚ))
3. Compute L ← ∑_{t∈S} q_t; for all t in V: Δ^{ADS}_t ← q_t (1_{t∈S} − L)
4. Query teacher logits: zₜ ← z(·|x₁:ₗ;θₜ)
5. Perturb & sample:  ẑ ← zₜ + λ Δ^{ADS}; x_{l+1} ∼ Softmax(ẑ/τ)
6. Return x_{l+1}

Hyperparameters include the penalty λ (regulating fingerprint strength), temperature τ, and window size w for green-list computation (Xu et al., 3 Feb 2026).

4. Detection Protocol and Statistical Test

Given a (possibly closed-weight) student model θsθ_s, detection is conducted by evaluating the mean green-token probability over an evaluation set 𝒳={xi}i𝒳 = \{x_i\}_i:

GTP(𝒳,θs,k)=1ni=1nPrtθs[tH(xi,k)]\text{GTP}(𝒳, θ_s, k) = \frac{1}{n} \sum_{i=1}^n \Pr_{t \sim θ_s}[t \in H(x_i, k)]

Under the null hypothesis that the student has not absorbed the fingerprint (i.e., green-list membership is random), the mean is γγ; deviations can be quantified with Hoeffding's inequality:

p=exp(2n(gobsγ)2)p = \exp(-2n (g_{obs} - γ)^2)

where gobsg_{obs} is the observed GTP. Thresholding pp yields a statistically controlled false-positive rate for the declaration “model trained on fingerprinted data.”

5. Empirical Performance and Pareto Analysis

Experimental results on GSM8K and OASST1 benchmarks validate ADFP’s efficacy and utility trade-off. Compared to red-green list baselines, ADFP consistently yields lower pp-values (stronger detection) at the same or higher answer accuracy or negative log-likelihood (minimizing utility loss). For instance, on GSM8K with a closed-weight student (Distil-Qwen2.5-3B), at teacher accuracy 40%40\%, the baseline achieves p=2×102p=2 \times 10^{-2}, while ADFP yields p=3×103p=3 \times 10^{-3}. Further, at 0%0\% false positive rate, ADFP achieves a true positive rate of 55%55\% vs the baseline’s 24%24\%, and area under the ROC curve (AUC) improvements of $0.10$–$0.15$ are observed consistently.

A selection of resulting metrics:

Setting Teacher Acc Baseline p-value ADFP p-value
λ/δ = 7 (GSM8K closed) 40% 2×1022 \times 10^{-2} 3×1033 \times 10^{-3}
λ/δ = 14 28% 5×1035 \times 10^{-3} 1×1041 \times 10^{-4}

These results substantiate the claimed Pareto improvement: fingerprint detectability is strengthened without an equivalent sacrifice in generative quality.

6. Limitations, Critical Assumptions, and Extensions

  • Proxy-student mismatch: The accuracy of ADFP depends on the fidelity of the proxy θpθ_p in approximating the actual student. Divergence in architecture or learning rate can diminish detectability. Ensemble proxies or post-hoc adaptive refinement are possible mitigations.
  • Domain shift: If the semantic effect of the fingerprint (green-list token probabilities) changes across domains seeded by the teacher and student, detectability may deteriorate. Context-dependent fingerprinting or higher-level functional fingerprints are possible extensions.
  • Partial fingerprinting: ADFP remains robust with down to approximately 20%20\% of training data fingerprinted; below this, detection rates decrease. Adaptive thresholds or mixed-key stratagems could extend coverage.
  • Adversarial “unlearning”: A determined student could apply adversarial fine-tuning to attenuate the fingerprint. This poses an open challenge in defending persistent attribution schemes.
  • One-step update approximation: The present methodology employs a first-order surrogate for learning dynamics; considering multi-step or higher-order gradients may further enhance the method’s precision or resilience.

This suggests that ADFP occupies a robust niche for reliable black-box fingerprinting of distillation-extracted models under realistic, non-collusive threat models—with both utility and statistical power outperforming prior token-perturbation approaches (Xu et al., 3 Feb 2026).

7. Connections to Prior Fingerprinting and Conferrable Adversarial Examples

Earlier fingerprinting work in neural networks for model extraction detection, notably Conferrable-Example Fingerprinting (CEF) (Lukas et al., 2019), operates on the generation of conferrable adversarial examples. These targeted adversarial instances are “conferrable” in that they transfer from the source to surrogates but not to references independently trained on ground truth. The fingerprint is constructed by optimizing a composite objective so that only surrogates—typically produced by distillation or retraining on teacher labels—exhibit the misclassification on these inputs. The CEF/ADFP method achieves perfect separability (ROC AUC =1.0=1.0) under a wide array of distillation and model modification attacks and substantiates the “antidistillation” moniker by demonstrating resilience to knowledge-distillation-style model-extraction attacks. ADFP extends this paradigm into the sequence modeling and LLM context with sampling-aligned, training-dynamic-aware fingerprints, preserving detection efficacy even as models and domains grow in scale and heterogeneity (Lukas et al., 2019, Xu et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Antidistillation Fingerprinting (ADFP).