Antidistillation Fingerprinting (ADFP)

Updated 5 February 2026

Antidistillation Fingerprinting (ADFP) is a method that embeds detectable signatures in student models trained on teacher outputs via gradient-driven logit perturbations.
It leverages a proxy model to align fingerprint signals with expected student updates, thereby enhancing detection accuracy over heuristic watermarking techniques.
Empirical evaluations on benchmarks show ADFP achieves superior statistical performance and minimal loss of model utility compared to traditional methods.

Antidistillation Fingerprinting (ADFP) is a statistically grounded approach for robustly detecting whether a machine learning model—or LLM in particular—has been trained, wholly or in part, on outputs sampled from a specified “teacher” model. In contrast to earlier heuristic logit perturbation or token-biasing schemes, ADFP constructs fingerprints by aligning the fingerprinting signal with the expected learning dynamics of a distillation-trained student model. This paradigm is motivated by the need for verifiable attribution in the context of LLMs and is operationalized via gradient-based manipulation of generation distributions to maximize post-distillation detectability without compromising model utility (Xu et al., 3 Feb 2026).

1. Theoretical Objective and Methodological Distinction

The central objective of Antidistillation Fingerprinting is to generate training data from a teacher model that, when ingested by a student model via distillation, results in a persistent statistical signature detectable via black-box queries. Formally, let $𝒟$ be the data distribution, $θ_t$ the teacher, $θ_s$ a student fine-tuned on teacher outputs, and $H(x,k) \subseteq V$ the key-dependent “green list” of tokens for prompt $x$ under owner secret $k$ , with $V$ the vocabulary. The detection statistic is the per-context green-token probability:

$D(θ;x,k) = \sum_{t \in H(x,k)} q_θ(t|x)$

where $q_θ(t|x)$ is the probability of token $t$ under $θ$ .

Distilling the fingerprinting objective, ADFP seeks logit perturbations $Δ$ such that, after sampling with these perturbations and subsequent fine-tuning of a student on the resulting data, the expected detectability in $θ_s$ is maximized:

$\mathbb{E}_{x \sim 𝒟} \mathbb{E}_{t \sim \text{Softmax}((z_t(x)+Δ(x))/\tau)} [ D(θ_s(Δ);x,k) ]$

where $z_t(x)$ is the teacher’s logit vector and $\tau$ is the sampling temperature.

Distinctly, prior watermark or fingerprint schemes (e.g., red-green list watermarks [Kirchenbauer et al. 2023]) apply fixed logit boosts to “green” tokens, ignoring the effect on the student’s parameter updates. ADFP instead applies gradient-based logit perturbations derived to maximize green-list detectability post-distillation, explicitly aligning the perturbation direction with anticipated parameter shifts (Xu et al., 3 Feb 2026).

2. Gradient-Based Fingerprint Perturbation Derivation

Building on the Antidistillation Sampling (ADS) framework [Savani et al. 2025], ADFP employs a proxy model $θ_p$ that approximates the student’s update trajectory during fine-tuning. Given context $x$ , define $z = z(·|x;θ_p)$ and $q = \text{softmax}(z)$ . For green-list $S = H(x,k)$ , the instantaneous fingerprint loss is $L(x) = \sum_{t \in S} q_t$ .

To maximize this quantity with respect to student updates, the optimal per-token perturbation $Δ_t$ aligns with the dot product of gradients:

$Δ_t \propto \left\langle \nabla_{θ_p} \log q_t, \nabla_{θ_p} L \right\rangle$

Under an isotropic approximation of the intermediate Jacobian, this admits the closed form:

$Δ^{ADS}_t = q_t (1_{t \in S} - L)$

Thus, tokens in the green list with high conditional probability under the proxy are preferentially boosted; outside tokens are slightly suppressed. This construction explicitly targets the tokens whose increased teacher logit most rapidly amplify the persistent signature after fine-tuning.

3. ADFP Sampling Algorithm

The operational ADFP sampling method proceeds as follows:

Input: Teacher θₜ, Proxy θₚ, context x₁:ₗ, key k, penalty λ, temp τ, window w
1. Compute green-list S ← H(x_{l−w+1:l}, k)
2. Query proxy: q ← softmax(z(·|x₁:ₗ;θₚ))
3. Compute L ← ∑_{t∈S} q_t; for all t in V: Δ^{ADS}_t ← q_t (1_{t∈S} − L)
4. Query teacher logits: zₜ ← z(·|x₁:ₗ;θₜ)
5. Perturb & sample:  ẑ ← zₜ + λ Δ^{ADS}; x_{l+1} ∼ Softmax(ẑ/τ)
6. Return x_{l+1}

Hyperparameters include the penalty λ (regulating fingerprint strength), temperature τ, and window size w for green-list computation (Xu et al., 3 Feb 2026).

4. Detection Protocol and Statistical Test

Given a (possibly closed-weight) student model $θ_s$ , detection is conducted by evaluating the mean green-token probability over an evaluation set $𝒳 = \{x_i\}_i$ :

$\text{GTP}(𝒳, θ_s, k) = \frac{1}{n} \sum_{i=1}^n \Pr_{t \sim θ_s}[t \in H(x_i, k)]$

Under the null hypothesis that the student has not absorbed the fingerprint (i.e., green-list membership is random), the mean is $γ$ ; deviations can be quantified with Hoeffding's inequality:

$p = \exp(-2n (g_{obs} - γ)^2)$

where $g_{obs}$ is the observed GTP. Thresholding $p$ yields a statistically controlled false-positive rate for the declaration “model trained on fingerprinted data.”

5. Empirical Performance and Pareto Analysis

Experimental results on GSM8K and OASST1 benchmarks validate ADFP’s efficacy and utility trade-off. Compared to red-green list baselines, ADFP consistently yields lower $p$ -values (stronger detection) at the same or higher answer accuracy or negative log-likelihood (minimizing utility loss). For instance, on GSM8K with a closed-weight student (Distil-Qwen2.5-3B), at teacher accuracy $40\%$ , the baseline achieves $p=2 \times 10^{-2}$ , while ADFP yields $p=3 \times 10^{-3}$ . Further, at $0\%$ false positive rate, ADFP achieves a true positive rate of $55\%$ vs the baseline’s $24\%$ , and area under the ROC curve (AUC) improvements of $0.10$–$0.15$ are observed consistently.

A selection of resulting metrics:

Setting	Teacher Acc	Baseline p-value	ADFP p-value
λ/δ = 7 (GSM8K closed)	40%	$2 \times 10^{-2}$	$3 \times 10^{-3}$
λ/δ = 14	28%	$5 \times 10^{-3}$	$1 \times 10^{-4}$

These results substantiate the claimed Pareto improvement: fingerprint detectability is strengthened without an equivalent sacrifice in generative quality.

6. Limitations, Critical Assumptions, and Extensions

Proxy-student mismatch: The accuracy of ADFP depends on the fidelity of the proxy $θ_p$ in approximating the actual student. Divergence in architecture or learning rate can diminish detectability. Ensemble proxies or post-hoc adaptive refinement are possible mitigations.
Domain shift: If the semantic effect of the fingerprint (green-list token probabilities) changes across domains seeded by the teacher and student, detectability may deteriorate. Context-dependent fingerprinting or higher-level functional fingerprints are possible extensions.
Partial fingerprinting: ADFP remains robust with down to approximately $20\%$ of training data fingerprinted; below this, detection rates decrease. Adaptive thresholds or mixed-key stratagems could extend coverage.
Adversarial “unlearning”: A determined student could apply adversarial fine-tuning to attenuate the fingerprint. This poses an open challenge in defending persistent attribution schemes.
One-step update approximation: The present methodology employs a first-order surrogate for learning dynamics; considering multi-step or higher-order gradients may further enhance the method’s precision or resilience.

This suggests that ADFP occupies a robust niche for reliable black-box fingerprinting of distillation-extracted models under realistic, non-collusive threat models—with both utility and statistical power outperforming prior token-perturbation approaches (Xu et al., 3 Feb 2026).

7. Connections to Prior Fingerprinting and Conferrable Adversarial Examples

Earlier fingerprinting work in neural networks for model extraction detection, notably Conferrable-Example Fingerprinting (CEF) (Lukas et al., 2019), operates on the generation of conferrable adversarial examples. These targeted adversarial instances are “conferrable” in that they transfer from the source to surrogates but not to references independently trained on ground truth. The fingerprint is constructed by optimizing a composite objective so that only surrogates—typically produced by distillation or retraining on teacher labels—exhibit the misclassification on these inputs. The CEF/ADFP method achieves perfect separability (ROC AUC $=1.0$ ) under a wide array of distillation and model modification attacks and substantiates the “antidistillation” moniker by demonstrating resilience to knowledge-distillation-style model-extraction attacks. ADFP extends this paradigm into the sequence modeling and LLM context with sampling-aligned, training-dynamic-aware fingerprints, preserving detection efficacy even as models and domains grow in scale and heterogeneity (Lukas et al., 2019, Xu et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Antidistillation Fingerprinting (2026)

Deep Neural Network Fingerprinting by Conferrable Adversarial Examples (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Antidistillation Fingerprinting (ADFP).