Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approximating Language Model Training Data from Weights (2506.15553v1)

Published 18 Jun 2025 in cs.CL

Abstract: Modern LLMs often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.

Summary

  • The paper introduces SELECT, a gradient-based method to approximate hidden finetuning data by selecting effective examples from a large public corpus.
  • It details a gradient alignment strategy using synthetic checkpoints and JL projection to efficiently mimic the weight change from base to finetuned models.
  • The method underscores the risk of data leakage from released model weights, highlighting implications for privacy and proprietary training data protection.

This paper addresses the problem of approximating the training data used to finetune LLMs, particularly in scenarios where model weights are publicly released but the training data remains private (open-weights, closed-data). The authors propose a method called SELECT (Selection of Effective LLM Examples from Candidate Text) to identify a subset of documents from a large public corpus that can effectively substitute the original, unknown finetuning data.

The core assumption is that an adversary has access to the initial "base" model weights (θ0\theta_0) before finetuning and the "final" model weights (θf\theta_f) after finetuning. The goal is to find a dataset D\mathcal{D}^* from a large seed corpus D\mathcal{D} such that training the base model θ0\theta_0 on D\mathcal{D}^* results in a model close to θf\theta_f.

SELECT Method: Implementation Details

The SELECT method is a gradient-based approach that greedily selects datapoints from a seed corpus. The intuition is that the gradient of the loss function with respect to the initial model parameters θ0\theta_0, when computed on effective training examples, should align with the direction of the weight change from the base model to the final model (θfθ0\theta_f - \theta_0).

1. Objective Function:

The method aims to find a batch of examples B\mathcal{B} from the seed corpus D\mathcal{D} that maximizes the sum of dot products between the per-example gradients (computed at θ0\theta_0) and the overall model weight difference:

argmaxBD[xB(x;θ0)(θfθ0)]\underset{\mathcal{B} \subseteq \mathcal{D}}{\arg\max} \left[ \sum_{x \in \mathcal{B}} \nabla \ell (x; \theta_0) \cdot (\theta_f - \theta_0) \right]

Solving this exactly is intractable. The paper notes this objective is submodular, allowing for an efficient greedy approximation.

2. Autolabeling:

Since the true labels of the finetuning data are unknown, if the task is classification, the final model θf\theta_f is used to assign pseudo-labels to the unlabeled text in the seed corpus:

y^i=argmax(xi;θf)\hat{y}_i = \arg\max \ell(x_i; \theta_f)

For supervised finetuning (SFT) tasks where the input is the prefix and the output is the continuation, the model θf\theta_f can directly generate continuations, or the problem can be framed as selecting inputs that lead to high-likelihood outputs according to θf\theta_f. The paper focuses on classification for autolabeling but applies the gradient method to SFT as well.

3. Synthetic Checkpoints:

To improve the robustness of gradient alignment, especially since only θ0\theta_0 and θf\theta_f are available, synthetic intermediate model checkpoints θ^j\hat{\theta}_j are created via linear interpolation:

θ^j=(jP)θ0+(1jP)θf\hat{\theta}_j = \left(\frac{j}{P}\right)\theta_0 + \left(1 - \frac{j}{P}\right)\theta_f

Here, PP is the number of synthetic checkpoints. The selection objective is then modified to average the alignment across these synthetic checkpoints:

argmaxBD[j=1PxB(x;θ^j)(θfθ^j)]\underset{\mathcal{B} \subseteq \mathcal{D}}{\arg\max} \left[ \sum_{j=1}^{P} \sum_{x \in \mathcal{B}} \nabla \ell (x; \hat{\theta}_j) \cdot (\theta_f - \hat{\theta}_j) \right]

4. Efficient Gradient Computation:

  • Last-Layer Gradients: To reduce computational cost, only the gradients of the last layer of the model are computed. This is justified by prior work showing these gradients can be sufficiently informative.
  • Batched Per-Example Gradients: torch.func.vmap is used to efficiently compute per-example gradients in a batched manner after a single forward pass.

5. Low-Dimensional Gradient Projection:

Storing full high-dimensional gradients for a large seed corpus is memory-intensive. The Johnson-Lindenstrauss (JL) lemma is applied to project gradients into a lower-dimensional space while preserving inner products. This means that instead of computing (x;θ^j)(θfθ^j)\nabla \ell (x; \hat{\theta}_j) \cdot (\theta_f - \hat{\theta}_j), the computation is done with projected gradients: R(x;θ^j)R(θfθ^j)R \nabla \ell (x; \hat{\theta}_j) \cdot R (\theta_f - \hat{\theta}_j), where RR is a random projection matrix.

Algorithm 1: SELECT Overview

The practical algorithm (Algorithm 1 in the paper) proceeds as follows:

  1. Autolabel (Line 1): Assign pseudo-labels Y^\hat{Y} to the seed dataset XX using the final model θf\theta_f.
  2. Synthetic Checkpoints (Line 2): Generate NN synthetic model checkpoints {θ^i}\{\hat{\theta}_i\} between θ0\theta_0 and θf\theta_f.
  3. Compute Per-Example Gradients (Line 3): Compute gradients GG for each example in (X,Y^)(X, \hat{Y}) using the synthetic checkpoints {θ^i}\{\hat{\theta}_i\}. As noted, this is practically done for the last layer.
  4. Project Gradients (Line 4): Apply JL projection to get lower-dimensional gradients G^\hat{G} (dimension dd).
  5. Greedy Selection (Lines 5-11):
    • Initialize an empty running gradient sum gbg_b and an empty list of selected indices I\mathcal{I}.
    • Iteratively select MM datapoints:
      • In each step, find the example ii^* whose projected gradient G^i\hat{G}_{i^*} maximizes the dot product with the (projected) target direction (θfθ^j)(\theta_f - \hat{\theta}_j), potentially adjusted by the current sum gbg_b to encourage diversity or better represent the batch gradient. The paper's Algorithm 1 states: iargmax(G^θ^tT)i^* \leftarrow \arg\max(\hat{G} \cdot \hat{\theta}_t^T), where θ^t\hat{\theta}_t seems to represent the target direction (θfθ^j)(\theta_f - \hat{\theta}_j) or an aggregation. Then it updates a running sum: G^G^+broadcast(G^i)\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^*}). Correction based on paper text: The objective is xB(x;θ0)(θfθ0)\sum_{x \in \mathcal{B}} \nabla \ell (x; \theta_0) \cdot (\theta_f - \theta_0). The greedy step selects the next xx that maximizes its contribution to this sum, considering already selected items. The line "G^G^+broadcast(G^i)\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^{*}})" in the algorithm suggests a strategy where the selection criterion or the available gradients are modified after each pick, possibly to account for the "diminishing returns" property of submodular functions, by effectively trying to match the remaining part of the target vector θfθ0\theta_f - \theta_0.

A simplified greedy selection step would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
selected_indices = []
remaining_target_direction = project(theta_f - theta_0) // or sum over synthetic checkpoints
candidate_projected_gradients = project(all_example_gradients_at_theta_0)

for _ in range(M): // M is the number of samples to select
  best_score = -infinity
  best_idx = -1
  for i in range(len(candidate_projected_gradients)):
    if i not in selected_indices:
      score = dot_product(candidate_projected_gradients[i], remaining_target_direction)
      if score > best_score:
        best_score = score
        best_idx = i
  selected_indices.append(best_idx)
  // Optionally, update remaining_target_direction if using a batch-aware greedy strategy
  // remaining_target_direction -= candidate_projected_gradients[best_idx]
The paper's "G^G^+broadcast(G^i)\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^{*}})" along with "argmax(G^θ^tT)\arg\max(\hat{G} \cdot \hat{\theta}_t^T)" implies the selection is based on the current gradient plus a running sum, which is a bit unusual. More typically, for submodular maximization, the marginal gain is calculated. However, the paper explicitly states: "perform greedy data selection by picking instances with the highest running gradient sum."

Experimental Setup and Results

Tasks and Models:

  • Classification: AG News, DBPedia, IMDB using GPT-2 medium (355M parameters).
  • Supervised Finetuning (SFT): MSMARCO web documents using Llama-3.2 (1B and 3B parameters).
  • Seed Data: Wikipedia documents from Natural Questions.

Metrics:

  • Task Performance: Accuracy for classification, Perplexity (PPL) for SFT.
  • Dataset Similarity:
    • Vocabulary Containment (Vocab \uparrow): Measures token overlap.
    • Optimal Transport Distance (OTD \downarrow): Measures semantic similarity between sentence embedding distributions.

Baselines:

  • Random selection.
  • Top-K: Selects examples with the highest individual gradient alignment, without greedy batch consideration.
  • Top-K (Balanced): Top-K constrained to select an equal number of examples per class.
  • P-Min / P-Max: Selecting based on lowest/highest perplexity using the final model.

Key Findings:

  • Classification (Table 1): SELECT significantly outperforms random selection and other baselines. On AG News, SELECT achieves 80.05% accuracy when selecting 1K data points, compared to 65.62% for random and an 88.32% "expert" benchmark (training on 10K true data points). It also generally yields better vocabulary overlap and OTD.
    • For AG News: Random (65% Acc) -> SELECT (80% Acc)
  • SFT (Table 2): SELECT effectively reduces perplexity. For Llama-3.2-3B on MSMARCO, SELECT achieves PPL of 2.30, compared to 3.31 for random and 2.01 for the expert model.
    • For Llama-3.2-3B: Random (PPL 3.3) -> SELECT (PPL 2.3)
  • Scaling Selection Size (Figure \ref{fig:scaling-selection}): SELECT's advantage over random selection increases as more data points are selected (tested from 100 to 2.5K). Top-K performs poorly, likely due to selecting redundant examples.
  • Leakage Analysis (Figure \ref{fig:leakage}): If the true finetuning data is part of the seed corpus, SELECT's performance (both task accuracy and dataset similarity) improves as the proportion of true data in the seed set increases, indicating it can "locate" actual training samples.
  • Ablations:
    • Gradient Projection Dimension (Figure \ref{fig:ablation-projection-dim}): SELECT outperforms random even with a projection dimension as low as 512. The paper uses 4096 for experiments.
    • Seed Data Distribution (Table \ref{tab:heatmap}): Performance is best when seed data distribution matches the target task, but general-purpose corpora like Natural Questions (Wikipedia) are surprisingly effective as seed datasets across various tasks.
    • Optimizer Knowledge (Table \ref{tab:optimizer_comparison}): The method is somewhat robust to mismatched optimizers between the original finetuning and the selection process. Using SGD (which lacks second-order moments) for selection when the original used Adam can even improve the signal. AdamW slightly degrades performance due to weight decay.

Practical Implications and Implementation Considerations

  • Applicability: This method is relevant for scenarios where a base model and a finetuned version are released without the finetuning data (e.g., Llama series, DeepSeek models).
  • Computational Cost:
    • Requires forward passes and gradient computations on the seed dataset. Last-layer gradients and vmap help manage this.
    • JL projection significantly reduces memory for storing gradients (D×|\mathcal{D}| \times |\nabla \ell| to D×d|\mathcal{D}| \times d, where dd \ll |\nabla \ell|).
    • The number of synthetic checkpoints PP adds a multiplicative factor to gradient computations.
  • Seed Corpus: A large, diverse seed corpus is crucial for finding suitable data. The quality and relevance of this corpus will directly impact the effectiveness of the selected subset.
  • Limitations:
    • The method selects data, it doesn't generate it. Performance is capped by the quality of the best available examples in the seed corpus.
    • Requires knowledge of θ0\theta_0, θf\theta_f, and optimizer details (though some robustness to optimizer mismatch is shown).
    • The effectiveness can vary depending on the task and the dissimilarity between the base and finetuned models.

Conclusion and Ethical Considerations

The paper demonstrates that significant information about finetuning data can be inferred solely from model weights if both pre-finetune and post-finetune checkpoints are available. The SELECT method provides a practical approach to approximate this data by selecting relevant examples from a public corpus. This has implications for intellectual property protection, as model creators releasing open weights might inadvertently reveal characteristics of their proprietary datasets. The authors advise caution when releasing open-weights models if the training data is intended to remain private.

This research is a step towards understanding data leakage from model weights and could motivate further work in both data recovery techniques and privacy-preserving model release strategies.

Youtube Logo Streamline Icon: https://streamlinehq.com