LaMP Benchmark: Personalized Evaluation for LLMs

Updated 14 November 2025

The benchmark defines personalized classification and generation tasks that assess LLMs using user-specific profiles across real-world applications.
It employs retrieval-augmented strategies such as BM25, Contriever, and recency-based scoring to efficiently integrate historical user data into LLM prompts.
Empirical results show substantial performance improvements in both fine-tuned and zero-shot settings, demonstrating the benchmark’s practical impact on personalization.

The LaMP Benchmark is a unified suite for evaluating the capacity of LLMs to deliver genuinely personalized outputs by leveraging individual user profiles. Designed to address the limitations of “one-size-fits-all” NLP benchmarks, LaMP assesses LLMs’ ability to exploit user history for both classification and text generation tasks across diverse personalization scenarios, and provides a systematic framework for retrieval-augmented personalization modeling and evaluation (Salemi et al., 2023).

1. Motivation and Objectives

LaMP was developed to fill critical gaps left by conventional NLP benchmarks such as GLUE, SuperGLUE, KILT, and GEM, which do not account for individual-specific variation in language understanding or generation. Real-world NLP applications—including conversational agents, search engines, recommender systems, and assistive writing tools—require models that adapt to user histories and preferences rather than returning generic outputs. LaMP’s main objectives are:

To define a suite of tasks exemplifying realistic personalization challenges for LLMs.
To offer comprehensive data protocols enabling evaluation of both cold-start (new users) and temporal (future personalization) settings.
To rigorously evaluate retrieval-augmented prompting as a scalable strategy for injecting user-specific context into LLM inputs.

2. Task Suite and Dataset Construction

LaMP encompasses seven tasks: three personalized classification tasks and four personalized generation tasks, each constructed around the requirement to predict or generate a user-specific output $y$ for an input $x$ given a user profile $P_u = \{(x_{u1}, y_{u1}), ..., (x_{um}, y_{um})\}$ .

Classification tasks:

LaMP-1 (Citation): Predict which candidate reference a user will cite, based on draft paper’s title+abstract, leveraging their prior authored works.
LaMP-2 (Movie Tagging): Assign one of 15 tags to a movie description using the user’s previous movies and tags.
LaMP-3 (Product Rating): Predict a user’s star rating for a new product review, conditioned on prior reviews/ratings.

Generation tasks:

LaMP-4 (News Headline Generation): Generate article headlines conforming to the author’s style using their past articles and headlines.
LaMP-5 (Scholarly Title): Produce paper titles from abstracts, mimicking the author’s titling style based on their publication history.
LaMP-6 (Email Subject): Propose subject lines for user emails using personal historical email–subject pairs.
LaMP-7 (Tweet Paraphrasing): Paraphrase tweets in a style aligned to the user’s prior tweets.

Each of these tasks is provided with two dataset splits:

User-based: Disjoint user sets for train/validation/test (cold-start personalization).
Time-based: Chronologically splits user history, predicting performance on future interactions.

Key Dataset Statistics

Task	#Users	Profile Size	Input Tokens	Output Tokens
Classification	1k–20k	50–200	30–454	–
Generation	1k–20k	18–287	30–454	7–17

3. Retrieval-Augmented Personalization Strategies

Direct inclusion of extensive user profiles often exceeds context-length limits of LLMs and may incur substantial inference costs. LaMP introduces two retrieval-based personalization methodologies:

Retrieval Scoring Functions

Term-Matching (BM25): Standard Okapi BM25 with $k_1=1.2$ , $b=0.75$ .
Semantic Matching (Contriever): Uses the cosine similarity between pre-trained dual encoder representations.
Recency-Based: Scores profile entries by inverse time-since-interaction.
Random Baseline: Uniform sampling from the profile, indicating even generic history can offer personalization improvements.

Prompt Integration

In-Prompt Augmentation (IPA): Concatenates top- $k$ retrieved profile entries, a task instruction, and input $x$ into a single LLM prompt. Compatible with any LLM; does not require model training.
Fusion-in-Decoder (FiD): Forms $k$ prompts (each comprising $x$ and a single retrieved item), encodes each separately, and allows the decoder to attend across all representations (encoder–decoder models only, requires fine-tuning).

4. Model Training, Evaluation Protocols, and Metrics

Experiments use both fine-tuned and zero-shot large LLMs. Fine-tuning is performed on FlanT5-base (250M parameters), while zero-shot settings use FlanT5-XXL and GPT-3.5. Hardware configuration for training is a single NVIDIA RTX 8000 GPU with 49GB VRAM. Training details include AdamW optimizer ( $\text{lr}=5 \times 10^{-5}$ , batch=64, weight decay= $10^{-4}$ , 5% warmup), 512-token max input, and a beam search size of 4.

Classification metrics:

Accuracy
Macro F1 (LaMP-2).
Mean absolute error (MAE) and root-mean-square error (RMSE) for ordinal labels (LaMP-3).

Generation metrics:

ROUGE-1 and ROUGE-L (n-gram and LCS-based recall over reference outputs).

5. Empirical Results and Analysis

Fine-Tuned Performance

Fine-tuned models with retrieval augmentation show substantial improvements across all tasks:

LaMP-1U (Citation): Accuracy improves from 0.518 (no personalization) to 0.754 (+45.6% relative) with FiD ( $k=16$ ).
LaMP-2U (Movie): Accuracy from 0.468 → 0.642 (F1: 0.435 → 0.607).
LaMP-3U (Rating): MAE 0.275 → 0.236.
LaMP-4U (News): ROUGE-1 0.153 → 0.180.

The average relative gain is approximately 23.5% in fine-tuned models.

Retrieval Method and Context Size Effects

Contriever is generally best for classification tasks and for most generation tasks with $k=1$ .
BM25 is optimal on LaMP-5/6 in user-based splits; recency is sometimes superior in time-based splits (e.g., LaMP-3T, LaMP-4T).
Increasing $k$ improves performance up to $k=16$ for IPA, but beyond a certain point, context-length overhead leads to plateau or degradation. FiD can handle larger $k$ more effectively.

Zero-Shot Personalization

Both FlanT5-XXL and GPT-3.5 exhibit ≈12.2% average relative improvement in zero-shot IPA personalization. For example, LaMP-1U increases from 0.520 to 0.699 accuracy for FlanT5-XXL.

Prompt-Tuning vs. Fine-Tuning

FiD (multi-prompt, full fine-tuning) slightly outperforms IPA (in-prompt) in classification, while IPA performs better for generation. Prompt-tuning—constructing IPA prompts without updating model weights—still yields large gains, offering practical personalization without model retraining.

6. Key Insights, Limitations, and Future Directions

Takeaways

Retrieval-augmented personalization offers scalable, effective improvements in both classification and generation.
The choice of retriever and prompt integration must be matched to the task and evaluation split.

Limitations

Task simplifications (e.g., binary citation, fixed set of tags) do not capture the full complexity of real applications.
ROUGE and related metrics imperfectly reflect personalized style or nuanced user relevance.
Potential risks of data leakage due to overlap with publicly available corpora.
Privacy: Fine-tuned LLMs may inadvertently memorize sensitive data, raising vulnerability to membership inference.

Research Directions

Develop prompt engineering techniques that better encode personalized context.
Investigate metrics sensitive to stylistic user alignment.
Learn retrievers jointly with language modeling objective over user profiles.
Design novel architectures to compress large, diverse profile histories for efficient context integration.
Explore differential privacy and federated protocols for privacy-preserving personalization.

LaMP’s comprehensive evaluation protocol, diverse task coverage, and systematic personalization modeling provide a reference point for the next generation of user-adaptive LLMs (Salemi et al., 2023).

PDF Markdown Chat (Pro)

References (1)

LaMP: When Large Language Models Meet Personalization (2023)

Follow Topic

Get notified by email when new papers are published related to LaMP Benchmark.