Papers
Topics
Authors
Recent
Search
2000 character limit reached

Refinement Provenance Inference (RPI)

Updated 12 January 2026
  • Refinement Provenance Inference is a method to audit LLM fine-tuning and identify whether a prompt was used in its raw form or after LLM-based refinement.
  • It leverages teacher-forced token statistics like normalized NLL, top-k inclusion, and confidence margin to detect systematic distributional shifts.
  • The RePro framework employs contrastive representation learning and a two-stage inference process to achieve robust and transferable provenance detection.

Refinement Provenance Inference (RPI) is the task of auditing fine-tuned LLMs at the instance level to determine whether a given prompt–response pair used for supervised fine-tuning was based on an original ("raw") user prompt or a version refined by an external LLM-based operator. In modern instruction tuning pipelines, prompt refinement—where raw prompts are rewritten for clarity and adherence to canonical instruction-following style by high-capacity LLMs (such as GPT-4o or Llama-3.3)—is widespread. RPI is central to dataset governance and dispute resolution, as it enables parties to verify the provenance of data used to train a released model, especially in the presence of mixed corpora where both raw and refined prompts are interleaved with unknown mixture ratios (Yin et al., 5 Jan 2026).

1. Formalization of the RPI Problem

RPI is defined with respect to a set of semantic instances each comprising a raw prompt xirawx_i^{\rm raw} and a fixed output yiy_i. An external deterministic LLM refiner R()R(\cdot) rewrites raw prompts to produce xiref=R(xiraw)x_i^{\rm ref}=R(x_i^{\rm raw}). During supervised fine-tuning, each instance is randomly included either as raw or refined according to a Bernoulli indicator ziBernoulli(ρ)z_i\sim\mathrm{Bernoulli}(\rho), with the effective training example being xitr=xirefx_i^{\rm tr}=x_i^{\rm ref} when zi=1z_i=1 and xirawx_i^{\rm raw} otherwise. The fine-tuned model (the "victim" MaM_a, initialized from a base M0M_0) thus sees a mixture of prompt forms. The auditor, given MaM_a and a prompt–response pair (xi,yi)(x_i,y_i), seeks to infer the latent binary label ziz_i indicating whether xix_i was raw or refined in training. Auditing is conducted by extracting a feature vector ϕ(Ma;xi,yi)Rd\phi(M_a;x_i,y_i)\in \mathbb R^d from teacher-forced next-token distributions, applying a learned scoring function gg, and predicting z^i=I[siτ]\hat z_i = \mathbf{I}[s_i \ge \tau] with si=g(ϕ())s_i = g(\phi(\cdot)). RPI performance is evaluated using AUC or TPR at low FPR thresholds (Yin et al., 5 Jan 2026).

2. Distributional Shifts Induced by LLM-Based Prompt Refinement

Refinement of prompts produces systematic, detectable shifts in the fine-tuned model's teacher-forced token behavior. This occurs because canonicalized, LLM-refined prompts more closely match the model's instruction-following priors, thereby inducing differentiated gradients during fine-tuning as compared to raw prompts. When victim model MaM_a is evaluated on (xi,yi)(x_i,y_i) via teacher forcing, the following per-instance statistics capture the distributional changes:

  • Normalized Negative Log-Likelihood (NLL): NLLM(i)=1yit=1yii,t(M)\mathrm{NLL}_M(i) = -\frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \ell^{(M)}_{i,t}, with i,t(M)\ell^{(M)}_{i,t} being the log-likelihood at timestep tt.
  • Top-kk Inclusion: TopKM(i)=1yit=1yiI[yi,tTopKt(M)]\mathrm{TopK}_M(i) = \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \mathbf{I}[y_{i,t} \in \mathrm{TopK}_t^{(M)}], where TopKt(M)\mathrm{TopK}_t^{(M)} contains the highest kk logits (with k=1,5,10k=1,5,10 in practice).
  • Confidence Margin: GapM(i)=1yit=1yi(st,(1)(M)st,(2)(M))\mathrm{Gap}_M(i) = \frac{1}{|y_i|} \sum_{t=1}^{|y_i|}(s_{t, (1)}^{(M)} - s_{t, (2)}^{(M)}), denoting the logit gap between the top two predictions.

To isolate fine-tuning effects, corresponding "uplift" features are computed as ΔS(i)=SM0(i)SMa(i)\Delta S(i) = S_{M_0}(i) - S_{M_a}(i), for S{NLL,TopK,Gap}S\in\{\mathrm{NLL}, \mathrm{TopK}, \mathrm{Gap}\}. These statistics, both in their base and "uplift" forms, are empirically observed to exhibit stable shifts distinguishing raw from refined provenance, even in cases where xirawx_i^{\rm raw} and xirefx_i^{\rm ref} have minimal surface variation (Yin et al., 5 Jan 2026).

3. The RePro Framework for Provenance Inference

RePro is a logit-centric provenance inference framework that aggregates the above teacher-forced features into a single vector and applies a two-stage learning procedure:

Feature Construction:

ϕ(M;xi,yi)=[NLL,  TopKk=1,5,10,  Gap,  ΔNLL,  ΔTopK,  ΔGap]\phi(M;x_i,y_i) = [\mathrm{NLL},\; \mathrm{TopK}_{k=1,5,10},\; \mathrm{Gap},\; \Delta\mathrm{NLL},\; \Delta\mathrm{TopK},\; \Delta\mathrm{Gap}]

with per-dimension standardization.

Stage 1: Shadow Contrastive Representation Learning

  • A shadow fine-tuning dataset, disjoint from the victim's data, is constructed with the same raw/refined proportion.
  • A shadow model McM_c is fine-tuned, and for each instance, its ϕi\phi_i is projected via an MLP encoder hψh_\psi to uiu_i.
  • Supervised contrastive loss is used: LSupCon=iB[1P(i)pP(i)logexp(sim(ui,up)/τ)aiexp(sim(ui,ua)/τ)]\mathcal L_{\rm SupCon} = \sum_{i\in\mathcal B}\Bigl[-\frac1{|\mathcal P(i)|} \sum_{p\in\mathcal P(i)} \log\frac{\exp(\mathrm{sim}(u_i,u_p)/\tau)} {\sum_{a\neq i}\exp(\mathrm{sim}(u_i,u_a)/\tau)}\Bigr] where P(i)\mathcal P(i) indexes positive examples (same zz-label) in the batch.
  • After contrastive pre-training, hψh_\psi is frozen and a lightweight linear classifier gg is fit to these embeddings using cross-entropy loss.

Stage 2: Victim Model Inference

  • For the target victim MaM_a and candidate (xj,yj)(x_j, y_j), ϕj\phi_j is extracted and embedded by hψh_\psi; gg predicts zjz_j.
  • The pipeline generalizes across models and refiners, obviating access to the victim’s training data or stochastic decoding.

This architecture enables robust inference of provenance in unseen victim models and with previously unseen refiners, as it targets distributional shifts that are common to LLM refinement rather than style artifacts (Yin et al., 5 Jan 2026).

4. Empirical Evaluation

RePro is empirically validated on GSM8K (math word problems) and HumanEval (code generation) benchmarks. For GSM8K, \sim8,000 problems are split, with raw or refined prompts generated via GPT-4o and Llama-3.3-70B. Fine-tuned victims are trained on a 50/50 mix (ρ=0.5\rho=0.5) with 500 LoRA updates (rank 16, learning rate 2×1042\times 10^{-4}). The evaluation involves no access to the victim's training data at inference.

RePro consistently outperforms three learning-free baselines (raw NLL, uplift Δ\DeltaNLL, and pairwise likelihood difference) in AUC and TPR@1%FPR metrics. On GSM8K, learning-free methods yield AUC in [0.52,0.62][0.52,0.62], while RePro achieves [0.66,0.71][0.66,0.71] and TPR@1%FPR [0.14,0.22][0.14,0.22]. On HumanEval, RePro yields AUC $0.63$–$0.68$ (vs.\ $0.50$–$0.60$ for baselines). Ablation studies reveal uplift features as most critical (\sim0.04 AUC drop when removed), with ranking and margin cues also contributing. Contrastive learning of representations yields \sim0.03–0.05 AUC improvement over linear probing. Detection improves with increasing refined fraction ρ\rho and longer fine-tuning (Yin et al., 5 Jan 2026).

Empirical results indicate that RePro exploits refiner-agnostic, distributional shifts in likelihoods rather than model- or refiner-specific artifacts; transfer across victim model families (Qwen2.5, Llama-3.1, Mistral) and across refiners (GPT-4o, Llama-3.3) results in minimal ($0.02$–$0.03$ AUC) performance loss.

5. Transferability and Generalization

RePro demonstrates robust transferability. Models trained with shadow data refined by one LLM (e.g., GPT-4o) transfer to victims refined by a different LLM (Llama-3.3) with negligible AUC loss. Similar generalization is observed across diverse victim architectures. This transferability arises because the framework targets universal shifts in the model's likelihood landscape induced by the act of instruction-style canonicalization, not the surface artifacts of any particular refiner's style. Thus, RePro is applicable to heterogeneous and black-box fine-tuning pipelines where the details of refinement are unavailable (Yin et al., 5 Jan 2026).

6. Limitations and Practical Considerations

RPI and RePro have several noted limitations:

  • API Accessibility: Inference requires per-token log probabilities and top-kk logits under teacher-forced evaluation; these are inaccessible in some black-box models.
  • Reference Outputs: Access to gold reference outputs yy is required; fully open-ended auditing (without references) is not addressed.
  • Single-Pass Refinement: The analysis is restricted to single-pass prompt rewriting, with input unchanged and fixed outputs. Pipelines that jointly refine both prompts and responses, or employ multi-stage or multi-turn refinement, may exhibit different and possibly reduced leakage.
  • Defense Strategies: Practitioners may reduce detectability by interleaving multiple refiners, randomizing refinement styles, or injecting adversarial paraphrases to mask distributional shifts.

Potential directions for extension include adapting RPI to weaker interfaces (e.g., generation-only queries), integrating it within multi-stage curation workflows, and developing obfuscation strategies to mitigate provenance leakage (Yin et al., 5 Jan 2026).

7. Broader Impact and Significance

RPI introduces a principled data-provenance audit mechanism in LLM fine-tuning regimes increasingly reliant on prompt refinement. By formalizing and empirically validating refinement provenance inference, RPI highlights the persistent statistical traces left by canonicalization steps in the training data workflow. The universal “fingerprint” of LLM-based prompt refinement has implications for dataset transparency, copyright and data ownership verification, and the design of robust instruction-tuning protocols. A plausible implication is that future data curation pipelines will need to balance the benefits of prompt refinement against increased provenance detectability, particularly in regulated or adversarial settings (Yin et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Refinement Provenance Inference (RPI).