Approximating Language Model Training Data from Weights (2506.15553v1)

Published 18 Jun 2025 in cs.CL

Abstract: Modern LLMs often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.

Summary

The paper introduces SELECT, a gradient-based method to approximate hidden finetuning data by selecting effective examples from a large public corpus.
It details a gradient alignment strategy using synthetic checkpoints and JL projection to efficiently mimic the weight change from base to finetuned models.
The method underscores the risk of data leakage from released model weights, highlighting implications for privacy and proprietary training data protection.

This paper addresses the problem of approximating the training data used to finetune LLMs, particularly in scenarios where model weights are publicly released but the training data remains private (open-weights, closed-data). The authors propose a method called SELECT (Selection of Effective LLM Examples from Candidate Text) to identify a subset of documents from a large public corpus that can effectively substitute the original, unknown finetuning data.

The core assumption is that an adversary has access to the initial "base" model weights ( $\theta_0$ ) before finetuning and the "final" model weights ( $\theta_f$ ) after finetuning. The goal is to find a dataset $\mathcal{D}^*$ from a large seed corpus $\mathcal{D}$ such that training the base model $\theta_0$ on $\mathcal{D}^*$ results in a model close to $\theta_f$ .

SELECT Method: Implementation Details

The SELECT method is a gradient-based approach that greedily selects datapoints from a seed corpus. The intuition is that the gradient of the loss function with respect to the initial model parameters $\theta_0$ , when computed on effective training examples, should align with the direction of the weight change from the base model to the final model ( $\theta_f - \theta_0$ ).

1. Objective Function:

The method aims to find a batch of examples $\mathcal{B}$ from the seed corpus $\mathcal{D}$ that maximizes the sum of dot products between the per-example gradients (computed at $\theta_0$ ) and the overall model weight difference:

$\underset{\mathcal{B} \subseteq \mathcal{D}}{\arg\max} \left[ \sum_{x \in \mathcal{B}} \nabla \ell (x; \theta_0) \cdot (\theta_f - \theta_0) \right]$

Solving this exactly is intractable. The paper notes this objective is submodular, allowing for an efficient greedy approximation.

2. Autolabeling:

Since the true labels of the finetuning data are unknown, if the task is classification, the final model $\theta_f$ is used to assign pseudo-labels to the unlabeled text in the seed corpus:

$\hat{y}_i = \arg\max \ell(x_i; \theta_f)$

For supervised finetuning (SFT) tasks where the input is the prefix and the output is the continuation, the model $\theta_f$ can directly generate continuations, or the problem can be framed as selecting inputs that lead to high-likelihood outputs according to $\theta_f$ . The paper focuses on classification for autolabeling but applies the gradient method to SFT as well.

3. Synthetic Checkpoints:

To improve the robustness of gradient alignment, especially since only $\theta_0$ and $\theta_f$ are available, synthetic intermediate model checkpoints $\hat{\theta}_j$ are created via linear interpolation:

$\hat{\theta}_j = \left(\frac{j}{P}\right)\theta_0 + \left(1 - \frac{j}{P}\right)\theta_f$

Here, $P$ is the number of synthetic checkpoints. The selection objective is then modified to average the alignment across these synthetic checkpoints:

$\underset{\mathcal{B} \subseteq \mathcal{D}}{\arg\max} \left[ \sum_{j=1}^{P} \sum_{x \in \mathcal{B}} \nabla \ell (x; \hat{\theta}_j) \cdot (\theta_f - \hat{\theta}_j) \right]$

4. Efficient Gradient Computation:

Last-Layer Gradients: To reduce computational cost, only the gradients of the last layer of the model are computed. This is justified by prior work showing these gradients can be sufficiently informative.
Batched Per-Example Gradients: torch.func.vmap is used to efficiently compute per-example gradients in a batched manner after a single forward pass.

5. Low-Dimensional Gradient Projection:

Storing full high-dimensional gradients for a large seed corpus is memory-intensive. The Johnson-Lindenstrauss (JL) lemma is applied to project gradients into a lower-dimensional space while preserving inner products. This means that instead of computing $\nabla \ell (x; \hat{\theta}_j) \cdot (\theta_f - \hat{\theta}_j)$ , the computation is done with projected gradients: $R \nabla \ell (x; \hat{\theta}_j) \cdot R (\theta_f - \hat{\theta}_j)$ , where $R$ is a random projection matrix.

Algorithm 1: SELECT Overview

The practical algorithm (Algorithm 1 in the paper) proceeds as follows:

Autolabel (Line 1): Assign pseudo-labels $\hat{Y}$ to the seed dataset $X$ using the final model $\theta_f$ .
Synthetic Checkpoints (Line 2): Generate $N$ synthetic model checkpoints $\{\hat{\theta}_i\}$ between $\theta_0$ and $\theta_f$ .
Compute Per-Example Gradients (Line 3): Compute gradients $G$ for each example in $(X, \hat{Y})$ using the synthetic checkpoints $\{\hat{\theta}_i\}$ . As noted, this is practically done for the last layer.
Project Gradients (Line 4): Apply JL projection to get lower-dimensional gradients $\hat{G}$ (dimension $d$ ).
Greedy Selection (Lines 5-11):
- Initialize an empty running gradient sum $g_b$ and an empty list of selected indices $\mathcal{I}$ .
- Iteratively select $M$ $M$ datapoints:
  - In each step, find the example $i^*$ whose projected gradient $\hat{G}_{i^*}$ maximizes the dot product with the (projected) target direction $(\theta_f - \hat{\theta}_j)$ , potentially adjusted by the current sum $g_b$ to encourage diversity or better represent the batch gradient. The paper's Algorithm 1 states: $i^* \leftarrow \arg\max(\hat{G} \cdot \hat{\theta}_t^T)$ , where $\hat{\theta}_t$ seems to represent the target direction $(\theta_f - \hat{\theta}_j)$ or an aggregation. Then it updates a running sum: $\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^*})$ . Correction based on paper text: The objective is $\sum_{x \in \mathcal{B}} \nabla \ell (x; \theta_0) \cdot (\theta_f - \theta_0)$ . The greedy step selects the next $x$ that maximizes its contribution to this sum, considering already selected items. The line " $\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^{*}})$ " in the algorithm suggests a strategy where the selection criterion or the available gradients are modified after each pick, possibly to account for the "diminishing returns" property of submodular functions, by effectively trying to match the remaining part of the target vector $\theta_f - \theta_0$ .

A simplified greedy selection step would be:

selected_indices = []
remaining_target_direction = project(theta_f - theta_0) // or sum over synthetic checkpoints
candidate_projected_gradients = project(all_example_gradients_at_theta_0)

for _ in range(M): // M is the number of samples to select
  best_score = -infinity
  best_idx = -1
  for i in range(len(candidate_projected_gradients)):
    if i not in selected_indices:
      score = dot_product(candidate_projected_gradients[i], remaining_target_direction)
      if score > best_score:
        best_score = score
        best_idx = i
  selected_indices.append(best_idx)
  // Optionally, update remaining_target_direction if using a batch-aware greedy strategy
  // remaining_target_direction -= candidate_projected_gradients[best_idx]

The paper's "

\hat{G} \leftarrow \hat{G} + \text{broadcast}(\hat{G}_{i^{*}})

" along with "

\arg\max(\hat{G} \cdot \hat{\theta}_t^T)

" implies the selection is based on the current gradient plus a running sum, which is a bit unusual. More typically, for submodular maximization, the marginal gain is calculated. However, the paper explicitly states: "perform greedy data selection by picking instances with the highest running gradient sum."

Experimental Setup and Results

Tasks and Models:

Classification: AG News, DBPedia, IMDB using GPT-2 medium (355M parameters).
Supervised Finetuning (SFT): MSMARCO web documents using Llama-3.2 (1B and 3B parameters).
Seed Data: Wikipedia documents from Natural Questions.

Metrics:

Task Performance: Accuracy for classification, Perplexity (PPL) for SFT.
Dataset Similarity:
- Vocabulary Containment (Vocab $\uparrow$ ): Measures token overlap.
- Optimal Transport Distance (OTD $\downarrow$ ): Measures semantic similarity between sentence embedding distributions.

Baselines:

Random selection.
Top-K: Selects examples with the highest individual gradient alignment, without greedy batch consideration.
Top-K (Balanced): Top-K constrained to select an equal number of examples per class.
P-Min / P-Max: Selecting based on lowest/highest perplexity using the final model.

Key Findings:

Classification (Table 1): SELECT significantly outperforms random selection and other baselines. On AG News, SELECT achieves 80.05% accuracy when selecting 1K data points, compared to 65.62% for random and an 88.32% "expert" benchmark (training on 10K true data points). It also generally yields better vocabulary overlap and OTD.
- For AG News: Random (65% Acc) -> SELECT (80% Acc)
SFT (Table 2): SELECT effectively reduces perplexity. For Llama-3.2-3B on MSMARCO, SELECT achieves PPL of 2.30, compared to 3.31 for random and 2.01 for the expert model.
- For Llama-3.2-3B: Random (PPL 3.3) -> SELECT (PPL 2.3)
Scaling Selection Size (Figure \ref{fig:scaling-selection}): SELECT's advantage over random selection increases as more data points are selected (tested from 100 to 2.5K). Top-K performs poorly, likely due to selecting redundant examples.
Leakage Analysis (Figure \ref{fig:leakage}): If the true finetuning data is part of the seed corpus, SELECT's performance (both task accuracy and dataset similarity) improves as the proportion of true data in the seed set increases, indicating it can "locate" actual training samples.
Ablations:
- Gradient Projection Dimension (Figure \ref{fig:ablation-projection-dim}): SELECT outperforms random even with a projection dimension as low as 512. The paper uses 4096 for experiments.
- Seed Data Distribution (Table \ref{tab:heatmap}): Performance is best when seed data distribution matches the target task, but general-purpose corpora like Natural Questions (Wikipedia) are surprisingly effective as seed datasets across various tasks.
- Optimizer Knowledge (Table \ref{tab:optimizer_comparison}): The method is somewhat robust to mismatched optimizers between the original finetuning and the selection process. Using SGD (which lacks second-order moments) for selection when the original used Adam can even improve the signal. AdamW slightly degrades performance due to weight decay.

Practical Implications and Implementation Considerations

Applicability: This method is relevant for scenarios where a base model and a finetuned version are released without the finetuning data (e.g., Llama series, DeepSeek models).
Computational Cost:
- Requires forward passes and gradient computations on the seed dataset. Last-layer gradients and vmap help manage this.
- JL projection significantly reduces memory for storing gradients ( $|\mathcal{D}| \times |\nabla \ell|$ to $|\mathcal{D}| \times d$ , where $d \ll |\nabla \ell|$ ).
- The number of synthetic checkpoints $P$ adds a multiplicative factor to gradient computations.
Seed Corpus: A large, diverse seed corpus is crucial for finding suitable data. The quality and relevance of this corpus will directly impact the effectiveness of the selected subset.
Limitations:
- The method selects data, it doesn't generate it. Performance is capped by the quality of the best available examples in the seed corpus.
- Requires knowledge of $\theta_0$ , $\theta_f$ , and optimizer details (though some robustness to optimizer mismatch is shown).
- The effectiveness can vary depending on the task and the dissimilarity between the base and finetuned models.

Conclusion and Ethical Considerations

The paper demonstrates that significant information about finetuning data can be inferred solely from model weights if both pre-finetune and post-finetune checkpoints are available. The SELECT method provides a practical approach to approximate this data by selecting relevant examples from a public corpus. This has implications for intellectual property protection, as model creators releasing open weights might inadvertently reveal characteristics of their proprietary datasets. The authors advise caution when releasing open-weights models if the training data is intended to remain private.

This research is a step towards understanding data leakage from model weights and could motivate further work in both data recovery techniques and privacy-preserving model release strategies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jxmnop/status/1936044750882119711

https://twitter.com/fly51fly/status/1935812107968430536

YouTube

Show All Videos

HackerNews

Approximating Language Model Training Data from Weights (2 points, 0 comments)