This paper analyzes training data memorization within the Reinforcement Learning with Human Feedback (RLHF) alignment process for code completion models. The paper investigates how memorization can surface and propagate through each of the three phases of RLHF: fine-tuning (FT), reward modeling (RM), and reinforcement learning fine-tuning (RLFT). The authors focus on code completion models due to their popularity and the potential privacy concerns associated with memorizing user data.
The authors find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized compared to direct fine-tuning. However, examples already memorized during the fine-tuning stage tend to remain memorized after RLHF.
The RLHF process is detailed as consisting of three stages:
- Fine-tuning (FT) a pre-trained LLM on code completion data using self-supervised learning.
- Training a reward model (RM) to approximate human preferences by assigning scalar scores to code completions.
- RLFT using the reward model as a scoring function.
The paper uses Gemini Nano-1 (1.8B) and T5-Base models and evaluates memorization by measuring whether the model generates the remainder of a training example when prompted with its prefix. The authors use normalized edit distance to quantify memorization, considering a completion as memorized if it falls within a specified threshold.
To mitigate false positives, the paper employs a counterfactual definition of memorization.
Definition 1 (Counterfactual memorization):
Let be an example from a training dataset , where is split into a prompt and a target . Let model $1$ be trained on and model $2$ be trained on . Then, is counterfactually memorized if model $1$ produces when prompted with using greedy decoding and model $2$ does not.
The authors generate a synthetic dataset (SD) of Python examples using Gemini Ultra, split into SD.Base and SD.Links. SD.Links is designed to measure memorization of Personally Identifiable Information (PII)-like information, while SD.Base measures full example memorization.
The experiments are divided into three parts:
- Measuring memorization of fine-tuning training data after RLFT.
- Measuring memorization of reward model training data after RLFT.
- Measuring memorization of RLFT prompts.
Key findings include:
- If examples are memorized during the fine-tuning stage, they are likely to remain memorized after RLFT.
- Training data used to optimize the reward model is unlikely to be memorized by the RLFT model.
- Memorization of data used for RLFT is possible, though the risk is low and depends on training hyperparameters.
The paper analyzes the impact of the Kullback–Leibler (KL) divergence penalty term () on memorization. A smaller allows the RLFT model to deviate from the initial fine-tuned model, potentially decreasing memorization of FT training data, but also increasing the risk of memorizing RLFT prompts.
Experiments on SD.Base show that after fine-tuning, 319 out of 3,526 examples are memorized. After RLFT, the memorization rate remains roughly the same, between 43-47%, with smaller values leading to completions with larger edit distances. On SD.Links, the fine-tuned model regurgitates PII-like file paths in 54.8% of completions, which reduces to 12.6% for the RLFT model with a small .
The paper also evaluates the potential for the RLFT model to memorize reward model training data. Results show that only 0.9% of examples are memorized when included in the reward model training dataset, compared to 17.6% when directly fine-tuned. The fine-tuned model (FT.3) regurgitates PII-like file paths in 50% of completions, reducing to 0% for the RLFT model (RLFT.2).
Finally, the paper measures the rate of prompt memorization in RLFT, finding that fewer than 0.5% of the RLFT training prompts are memorized after 70 epochs, even with a small .
The authors conclude that the risk of the final RLFT model memorizing sensitive data introduced during reward model training is very low. This makes it feasible to use sensitive or proprietary data during reward modeling to create a more representative supervision signal for reinforcement learning. They also note that the multi-stage process of RLHF introduces complexity in analyzing memorization, and future work could explore memorization in alternative post-training methods like Direct Preference Optimization (DPO).