LaMP-QA Benchmark

Updated 24 September 2025

The paper introduces LaMP-QA, a benchmark that uses user profiles to generate and evaluate personalized long-form answers.
It employs both automatic LLM-based scoring and human-centered reviews to robustly assess answer quality against user-specific contexts.
Experimental results reveal up to 39% improvement in answer quality with personalized context compared to non-personalized baselines.

LaMP-QA is a benchmark for evaluating personalized long-form question answering with LLMs, targeting the generation of answers that are specifically tailored to individual users’ information needs. Developed to address the lack of systematic resources for training and evaluating personalized QA systems, LaMP-QA focuses on situations where the answer should reflect not only general knowledge but the context and preferences expressed in the user’s prior interactions. The benchmark utilizes diverse real-world community Q&A data, supports rigorous human and automatic evaluation methodologies, and provides standardized baselines for both generic and adaptive model approaches. Results show marked improvement in answer quality when personal user profiles are leveraged, confirming the importance of context-relevant personalization in long-form QA.

1. Conceptual Foundation and Scope

LaMP-QA is designed to systematically evaluate question answering in a personalized, information-seeking framework. Unlike prior personalization studies focusing on short-form output or style-driven personalization (as in reviews or emails), LaMP-QA’s emphasis lies in generating long-form answers that address explicit requirements articulated by users. Each LaMP-QA instance is grounded in real-world data adapted from large community Q&A platforms (SE-PQA): each post consists of a question (title) and an associated narrative detailing the user’s background, motivations, or constraints.

Question topics cover three principal categories:

Arts & Entertainment
Lifestyle & Personal Development
Society & Culture

Collectively, these areas are represented in over 45 subcategories, ensuring broad topical diversity and enabling robust assessment across user types and domains.

2. Structure and Use of Personalization

Personalization within LaMP-QA is operationalized through user profiles. Each profile is a set of past questions previously asked by the user, serving as a proxy for what content and priorities are most relevant to them. The benchmark simulates deployment scenarios in which a QA system receives a new question $x_u$ from user $u$ , then generates an answer $ŷ_u$ conditioned on both the question and the personal context $P_u$ :

$ŷ_u = M(x_u, P_u)$

where $M$ denotes the LLM system.

This approach directly aligns with usage patterns on real community Q&A platforms, where user interests evolve and personalized suggestions or answers have practical impact on engagement and satisfaction.

3. Evaluation Protocols

LaMP-QA employs a dual system of automatic and human-centered evaluation to robustly appraise generated answers:

Automatic Evaluation: LLM-based scoring is used to assess the alignment of generated answers with a set of rubric aspects ( $E_x$ ) distilled from the user’s narrative ( $r_x$ ). The process involves (i) extracting core aspects ( $E_x = \{e_1, e_2, ..., e_n\}$ ), (ii) prompting LLMs to evaluate each generated answer ( $ŷ_u$ ) against these aspects, and (iii) assigning an aspect score in $[0, 2]$ for each aspect. The final score per answer is normalized:

$\text{Normalized Aspect Score} = \frac{s_a}{2}$

and the overall quality score is the mean across aspect scores.

Human Evaluation: Annotators review the narrative, extracted aspects, and generated answers, scoring (typically $1$–$5$) the alignment with the user’s articulated needs. Additional pairwise comparisons select, for a given question, the preferred answer from two candidates according to adherence to narrative constraints.

Comparative experiments indicate that aspect-driven evaluation provides the strongest agreement with human judgments (73%), outperforming direct rating and pairwise-only methods.

4. Modeling Approaches and Baselines

LaMP-QA benchmarks several QA strategies:

Non-Personalized Baseline: Models generate answers $ŷ_u = M(x_u)$ using only the question, independent of any user-specific context.
RAG-Personalization: Retrieval-augmented generation incorporates the top- $k$ most relevant items from the user profile, retrieved as $R(x_u, P_u, k)$ , then concatenates this with the question for answer generation:

$ŷ_u = M(x_u, R(x_u, P_u, k))$

Plan-RAG Personalization (PlanPers): A "planning" step precedes answer generation; a separate planner LLM ( $M_{\text{plan}}$ ) processes the question and retrieved items to infer salient answer aspects ( $p_{x_u}$ ). The answering LLM receives $(x_u, R(x_u, P_u, k), p_{x_u})$ as context.

Both open-source (Gemma 2, Qwen 2.5) and proprietary (GPT-4o-mini) LLMs are benchmarked. Explicit modeling of user intent via aspect planning yields enhanced alignment and answer quality irrespective of base model strength.

5. Experimental Findings and Performance Metrics

Empirical evaluation with LaMP-QA demonstrates that integrating personalized context through RAG methods delivers up to 39% improvement in answer quality over non-personalized baselines. When random or mismatched profiles are supplied (simulating poor personalization), answer quality decreases by as much as 62%, affirming that performance gains depend critically on the relevance of context.

A monotonic improvement trend is observed with increasing $k$ (number of retrieved items), indicating that richer user context facilitates better personalization. The findings suggest that for the domain of long-form QA, both retrieval precision and profile relevance are necessary drivers of quality.

6. Dataset Release and Research Implications

LaMP-QA is fully released on both GitHub and Hugging Face, comprising dataset, code, and evaluation scripts. This public commitment enables reproducible comparisons and fosters research into user-specific adaptation strategies. The resource supports development and fair evaluation against standard baselines, and provides a platform for exploring challenges including privacy, diversity in user profiles, and advanced interaction modeling.

A plausible implication is that LaMP-QA’s rubric-driven, aspect-based evaluation protocol may be adopted for future personalization studies in other NLP domains, given its proven correspondence with human criteria.

7. Context and Prospective Developments

LaMP-QA represents a paradigm shift in the QA benchmarking landscape, moving beyond style-mimicry to focus on deep, context-driven personalization aligned with explicit user goals. By structuring evaluation around fine-grained aspect extraction and scoring, and demonstrating large performance differentials due to personal context, LaMP-QA sets a technical foundation for subsequent research on long-form personalized answer generation.

Future work will likely engage challenges in automatic evaluation refinement, privacy-aware personalization, multidimensional user profiling, and more nuanced understanding of the tradeoffs between answer generality and user specificity. Public release of the benchmark positions it as a central resource for advancing personalized QA models and methodologies (Salemi et al., 30 May 2025).

PDF Markdown Chat (Pro)

References (1)

LaMP-QA: A Benchmark for Personalized Long-form Question Answering (2025)

Follow Topic

Get notified by email when new papers are published related to LaMP-QA Benchmark.