Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

LaMP-QA Benchmark

Updated 24 September 2025
  • The paper introduces LaMP-QA, a benchmark that uses user profiles to generate and evaluate personalized long-form answers.
  • It employs both automatic LLM-based scoring and human-centered reviews to robustly assess answer quality against user-specific contexts.
  • Experimental results reveal up to 39% improvement in answer quality with personalized context compared to non-personalized baselines.

LaMP-QA is a benchmark for evaluating personalized long-form question answering with LLMs, targeting the generation of answers that are specifically tailored to individual users’ information needs. Developed to address the lack of systematic resources for training and evaluating personalized QA systems, LaMP-QA focuses on situations where the answer should reflect not only general knowledge but the context and preferences expressed in the user’s prior interactions. The benchmark utilizes diverse real-world community Q&A data, supports rigorous human and automatic evaluation methodologies, and provides standardized baselines for both generic and adaptive model approaches. Results show marked improvement in answer quality when personal user profiles are leveraged, confirming the importance of context-relevant personalization in long-form QA.

1. Conceptual Foundation and Scope

LaMP-QA is designed to systematically evaluate question answering in a personalized, information-seeking framework. Unlike prior personalization studies focusing on short-form output or style-driven personalization (as in reviews or emails), LaMP-QA’s emphasis lies in generating long-form answers that address explicit requirements articulated by users. Each LaMP-QA instance is grounded in real-world data adapted from large community Q&A platforms (SE-PQA): each post consists of a question (title) and an associated narrative detailing the user’s background, motivations, or constraints.

Question topics cover three principal categories:

  • Arts & Entertainment
  • Lifestyle & Personal Development
  • Society & Culture

Collectively, these areas are represented in over 45 subcategories, ensuring broad topical diversity and enabling robust assessment across user types and domains.

2. Structure and Use of Personalization

Personalization within LaMP-QA is operationalized through user profiles. Each profile is a set of past questions previously asked by the user, serving as a proxy for what content and priorities are most relevant to them. The benchmark simulates deployment scenarios in which a QA system receives a new question xux_u from user uu, then generates an answer y^uŷ_u conditioned on both the question and the personal context PuP_u:

y^u=M(xu,Pu)ŷ_u = M(x_u, P_u)

where MM denotes the LLM system.

This approach directly aligns with usage patterns on real community Q&A platforms, where user interests evolve and personalized suggestions or answers have practical impact on engagement and satisfaction.

3. Evaluation Protocols

LaMP-QA employs a dual system of automatic and human-centered evaluation to robustly appraise generated answers:

  • Automatic Evaluation: LLM-based scoring is used to assess the alignment of generated answers with a set of rubric aspects (ExE_x) distilled from the user’s narrative (rxr_x). The process involves (i) extracting core aspects (Ex={e1,e2,...,en}E_x = \{e_1, e_2, ..., e_n\}), (ii) prompting LLMs to evaluate each generated answer (y^uŷ_u) against these aspects, and (iii) assigning an aspect score in [0,2][0, 2] for each aspect. The final score per answer is normalized:

Normalized Aspect Score=sa2\text{Normalized Aspect Score} = \frac{s_a}{2}

and the overall quality score is the mean across aspect scores.

  • Human Evaluation: Annotators review the narrative, extracted aspects, and generated answers, scoring (typically $1$–$5$) the alignment with the user’s articulated needs. Additional pairwise comparisons select, for a given question, the preferred answer from two candidates according to adherence to narrative constraints.

Comparative experiments indicate that aspect-driven evaluation provides the strongest agreement with human judgments (73%), outperforming direct rating and pairwise-only methods.

4. Modeling Approaches and Baselines

LaMP-QA benchmarks several QA strategies:

  • Non-Personalized Baseline: Models generate answers y^u=M(xu)ŷ_u = M(x_u) using only the question, independent of any user-specific context.
  • RAG-Personalization: Retrieval-augmented generation incorporates the top-kk most relevant items from the user profile, retrieved as R(xu,Pu,k)R(x_u, P_u, k), then concatenates this with the question for answer generation:

y^u=M(xu,R(xu,Pu,k))ŷ_u = M(x_u, R(x_u, P_u, k))

  • Plan-RAG Personalization (PlanPers): A "planning" step precedes answer generation; a separate planner LLM (MplanM_{\text{plan}}) processes the question and retrieved items to infer salient answer aspects (pxup_{x_u}). The answering LLM receives (xu,R(xu,Pu,k),pxu)(x_u, R(x_u, P_u, k), p_{x_u}) as context.

Both open-source (Gemma 2, Qwen 2.5) and proprietary (GPT-4o-mini) LLMs are benchmarked. Explicit modeling of user intent via aspect planning yields enhanced alignment and answer quality irrespective of base model strength.

5. Experimental Findings and Performance Metrics

Empirical evaluation with LaMP-QA demonstrates that integrating personalized context through RAG methods delivers up to 39% improvement in answer quality over non-personalized baselines. When random or mismatched profiles are supplied (simulating poor personalization), answer quality decreases by as much as 62%, affirming that performance gains depend critically on the relevance of context.

A monotonic improvement trend is observed with increasing kk (number of retrieved items), indicating that richer user context facilitates better personalization. The findings suggest that for the domain of long-form QA, both retrieval precision and profile relevance are necessary drivers of quality.

6. Dataset Release and Research Implications

LaMP-QA is fully released on both GitHub and Hugging Face, comprising dataset, code, and evaluation scripts. This public commitment enables reproducible comparisons and fosters research into user-specific adaptation strategies. The resource supports development and fair evaluation against standard baselines, and provides a platform for exploring challenges including privacy, diversity in user profiles, and advanced interaction modeling.

A plausible implication is that LaMP-QA’s rubric-driven, aspect-based evaluation protocol may be adopted for future personalization studies in other NLP domains, given its proven correspondence with human criteria.

7. Context and Prospective Developments

LaMP-QA represents a paradigm shift in the QA benchmarking landscape, moving beyond style-mimicry to focus on deep, context-driven personalization aligned with explicit user goals. By structuring evaluation around fine-grained aspect extraction and scoring, and demonstrating large performance differentials due to personal context, LaMP-QA sets a technical foundation for subsequent research on long-form personalized answer generation.

Future work will likely engage challenges in automatic evaluation refinement, privacy-aware personalization, multidimensional user profiling, and more nuanced understanding of the tradeoffs between answer generality and user specificity. Public release of the benchmark positions it as a central resource for advancing personalized QA models and methodologies (Salemi et al., 30 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LaMP-QA Benchmark.