Measuring memorization in RLHF for code completion (2406.11715v2)

Published 17 Jun 2024 in cs.LG, cs.CL, and cs.SE

Abstract: Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In addition to RLHF, other methods such as Direct Preference Optimization (DPO) and $\Psi$PO have gained popularity for learning directly from human preferences, removing the need for optimizing intermediary reward models with reinforcement learning. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning. We focus our study on code completion models, as code completion is one of the most popular use cases for LLMs. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized in comparison to directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF. In contrast, we find that aligning by learning directly from human preference data via a special case of $\Psi$PO, Identity Preference Optimization (IPO), increases the likelihood that training data is regurgitated compared to RLHF. Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning LLMs. We find our conclusions are robust across multiple code completion datasets, tasks, and model scales.

PDF HTML Abstract

This paper analyzes training data memorization within the Reinforcement Learning with Human Feedback (RLHF) alignment process for code completion models. The paper investigates how memorization can surface and propagate through each of the three phases of RLHF: fine-tuning (FT), reward modeling (RM), and reinforcement learning fine-tuning (RLFT). The authors focus on code completion models due to their popularity and the potential privacy concerns associated with memorizing user data.

The authors find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized compared to direct fine-tuning. However, examples already memorized during the fine-tuning stage tend to remain memorized after RLHF.

The RLHF process is detailed as consisting of three stages:

Fine-tuning (FT) a pre-trained LLM on code completion data using self-supervised learning.
Training a reward model (RM) to approximate human preferences by assigning scalar scores to code completions.
RLFT using the reward model as a scoring function.

The paper uses Gemini Nano-1 (1.8B) and T5-Base models and evaluates memorization by measuring whether the model generates the remainder of a training example when prompted with its prefix. The authors use normalized edit distance to quantify memorization, considering a completion as memorized if it falls within a specified threshold.

To mitigate false positives, the paper employs a counterfactual definition of memorization.

Definition 1 (Counterfactual memorization):

Let $x$ be an example from a training dataset $X$ , where $x$ is split into a prompt $x_p$ and a target $x_t$ . Let model $1$ be trained on $X$ and model $2$ be trained on $X\setminus\{x\}$ . Then, $x$ is counterfactually memorized if model $1$ produces $x_t$ when prompted with $x_p$ using greedy decoding and model $2$ does not.

The authors generate a synthetic dataset (SD) of Python examples using Gemini Ultra, split into SD.Base and SD.Links. SD.Links is designed to measure memorization of Personally Identifiable Information (PII)-like information, while SD.Base measures full example memorization.

The experiments are divided into three parts:

Measuring memorization of fine-tuning training data after RLFT.
Measuring memorization of reward model training data after RLFT.
Measuring memorization of RLFT prompts.

Key findings include:

If examples are memorized during the fine-tuning stage, they are likely to remain memorized after RLFT.
Training data used to optimize the reward model is unlikely to be memorized by the RLFT model.
Memorization of data used for RLFT is possible, though the risk is low and depends on training hyperparameters.

The paper analyzes the impact of the Kullback–Leibler (KL) divergence penalty term ( $\alpha$ ) on memorization. A smaller $\alpha$ allows the RLFT model to deviate from the initial fine-tuned model, potentially decreasing memorization of FT training data, but also increasing the risk of memorizing RLFT prompts.

Experiments on SD.Base show that after fine-tuning, 319 out of 3,526 examples are memorized. After RLFT, the memorization rate remains roughly the same, between 43-47%, with smaller $\alpha$ values leading to completions with larger edit distances. On SD.Links, the fine-tuned model regurgitates PII-like file paths in 54.8% of completions, which reduces to 12.6% for the RLFT model with a small $\alpha$ .

The paper also evaluates the potential for the RLFT model to memorize reward model training data. Results show that only 0.9% of examples are memorized when included in the reward model training dataset, compared to 17.6% when directly fine-tuned. The fine-tuned model (FT.3) regurgitates PII-like file paths in 50% of completions, reducing to 0% for the RLFT model (RLFT.2).

Finally, the paper measures the rate of prompt memorization in RLFT, finding that fewer than 0.5% of the RLFT training prompts are memorized after 70 epochs, even with a small $\alpha$ .

The authors conclude that the risk of the final RLFT model memorizing sensitive data introduced during reward model training is very low. This makes it feasible to use sensitive or proprietary data during reward modeling to create a more representative supervision signal for reinforcement learning. They also note that the multi-stage process of RLHF introduces complexity in analyzing memorization, and future work could explore memorization in alternative post-training methods like Direct Preference Optimization (DPO).

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Aneesh Pappu (6 papers)
Billy Porter (1 paper)
Ilia Shumailov (72 papers)
Jamie Hayes (47 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1803124138644561977

https://twitter.com/javaeeeee1/status/1804983903972704613

https://twitter.com/realmofresearch/status/1804873210099577297