Predicting memorization within Large Language Models fine-tuned for classification
Abstract: LLMs have received significant attention due to their abilities to solve a wide range of complex tasks. However these models memorize a significant proportion of their training data, posing a serious threat when disclosed at inference time. To mitigate this unintended memorization, it is crucial to understand what elements are memorized and why. This area of research is largely unexplored, with most existing works providing a posteriori explanations. To address this gap, we propose a new approach to detect memorized samples a priori in LLMs fine-tuned for classification tasks. This method is effective from the early stages of training and readily adaptable to other classification settings, such as training vision models from scratch. Our method is supported by new theoretical results, and requires a low computational budget. We achieve strong empirical results, paving the way for the systematic identification and protection of vulnerable samples before they are memorized.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.