PAL-Bench: Benchmark for Personalized Dialogue

Updated 24 November 2025

PAL-Bench is a benchmark that assesses personalized service-oriented dialogue systems through multi-session interactions and comprehensive user logs.
It introduces PAL-Set, a synthetic Chinese-language dataset generated via an LLM pipeline and refined by human annotators to simulate realistic user experiences.
H²Memory, a hierarchical and heterogeneous memory framework within PAL-Bench, enhances requirement restatement and solution selection, outperforming baseline models.

PAL-Bench is a benchmark for evaluating the personalization capabilities of service-oriented dialogue assistants in long-term user–agent interactions. It targets the assessment of assistants’ ability to understand and adapt to user-specific traits, evolving preferences, and behavioral history, with support for multi-session scenarios and complex requirement interpretation. Addressing the lack of public multi-session personalized dialogue benchmarks accompanied by long-term user logs, PAL-Bench introduces a synthetic dataset, PAL-Set, constructed via an LLM-centered pipeline and refined by human annotators. The benchmark is accompanied by the H²Memory framework, a hierarchical and heterogeneous memory architecture designed to model both concrete and abstract aspects of long-term user history, and demonstrates advances in requirement understanding and preference alignment over prior baselines (Huang et al., 17 Nov 2025).

1. Motivation and Benchmark Scope

PAL-Bench is designed in response to the increasing prevalence of service-oriented dialogue systems embedded in smart personal devices, where task fulfillment depends on accurate modeling of individual users’ requirements, goals, and personal context. Limitations of prior benchmarks include a focus on dialogue turns to the exclusion of behavioral logs, a homogeneous treatment of users, and a primary emphasis on factual knowledge retrieval. Furthermore, public datasets including user behavioral records remain unavailable due to privacy constraints, impeding reproducibility and controlled evaluation of memory-based personalization techniques.

PAL-Bench aims to fill these gaps by providing:

Evaluation on underspecified, subjective requirements where long-term user context is essential for interpretation.
Multi-session and multi-turn dialogues incorporating both device/app interaction logs and dialogue histories.
Stratified simulation of objective (logs, events) and subjective (preferences, historical dialogues) user data, supporting fine-grained evaluation across requirement restatement, solution proposal, and multi-turn dialogue.

2. Evaluation Tasks and Metrics

PAL-Bench defines three core evaluation tasks:

Requirement Restatement (Single-Turn QA):
- Input: User’s initial underspecified query with long-term memory context.
- Output: Single-sentence “complete requirement”—a restatement integrating implicit user needs.
- Metrics: BLEU-1 through BLEU-4 (against ground truth), and GPT-4 Score (0–2 scaled to 0–100) assessing implicit need capture.
Solution Proposal (Single-Turn QA):
- Subtask A: Generate one-sentence solution for given complete requirement.
- Subtask B: Select the 2 best solutions from a list of 8 candidates.
- Metrics: BLEU-1 to BLEU-4 for generation (against two positive references); Selection Score in [−100, 100] based on correct/incorrect picks.
Multi-Turn Dialogue Interaction:
- Role-played by a User-LLM, assessed by an Evaluation-LLM on:
  - Requirement understanding
  - Preference understanding
- Metric: Pairwise Win–Tie–Lose counts, following FairEval (six shuffles per comparison). Human ratings are used for correlation analysis.

These tasks are designed to systematically evaluate both the assistant’s capacity to recover latent requirements and to generate or select solutions consistent with historical user preferences.

3. PAL-Set Dataset Construction

PAL-Set is the foundational synthetic dataset for PAL-Bench, comprising Chinese-language multi-session user logs and dialogue histories. The construction process involves:

Profile and Persona Generation: Each profile specifies gender, age, Big Five personality traits (high/medium/low), and brief aspect-based descriptions (work, health, family, leisure). Expanded personas include monthly timelines with objective events (6–12 months) and for each aspect, 4–5 abstract requirement types annotated with positive/negative detailed preferences.
Session-Specific Scenario Expansion: Each timeline month is decomposed into 4–6 situations (≥5 sentences/situation), mapped to requirement types. Diary-style experience narratives ground the objective data, and dialogue frameworks are constructed comprising user_query, implicit_needs, and combined requirement. For each, eight candidate solutions are generated and labeled (two positive, two negative) via a two-stage LLM prompt.
Interaction Log Synthesis: Simulated using eight predefined log types (Web Search, Device Operation, etc.), each session features at least 20 log entries. Multi-turn, multi-topic dialogues are crafted using sequence templates to ensure contextual consistency.
Verification and Refinement: Automated checks confirm format and logical consistency; human annotators validate log-type matching, persona alignment, and coherence.

Dataset statistics (per 100 users):

Statistic	History	Query
Avg. # sessions	25.7	3.3
Avg. # logs	888.7	107.5
Avg. # dialogue turns	361.7	39.3
Avg. # dialogue topics	62.5	8.3
Avg. # months	8.4	1.0

Human evaluation of 50 sessions (3 annotators, 1–3 scale): Logs 2.75, Dialogues 2.67, indicating high profile consistency.

4. H²Memory: Hierarchical and Heterogeneous Memory Framework

H²Memory is architected to support PAL-Bench’s complex evaluation by abstracting and retrieving multi-level user information:

$M_G$ (Situation Memory): Organizes logs into session-level subgraph “situations,” constructed from “caused_by” or “follows” relations (LLM-inferred), each summarized into a node.
$M_B$ (Background Memory): Aggregates aspect-wise summaries (“work”, “health”, “family”, “leisure”), recursively updated after every session using $M_B^{(j)} = \mathrm{LLM}\bigl(M_B^{(j-1)}, M_G^j\bigr)$ .
$M_T$ (Topic Memory): For each dialogue topic, stores (requirement, solutions + feedback, preference). Requirements are refined using retrieval of $k$ -nearest situations.
$M_P$ (Principle Memory): Encodes abstract requirement types ( $\gamma_i$ ) and preference principles ( $\rho_i$ ), obtained by clustering topic requirements and summarizing via LLM. Updates dynamically with new inputs.

Retrieval mechanism employs cosine similarity: $s(q, k_i) = \frac{\exp (q\!\cdot\!k_i / \tau)}{\sum_j \exp(q\!\cdot\!k_j / \tau)}$ and response generation is formulated as: $P(y_t \mid y_{<t}, \mathbf{m}) = \mathrm{softmax}(g(h_t, \mathbf{m}))$ where $\mathbf{m}$ is the collection of retrieved memory items.

5. Experimental Results and Comparative Analysis

PAL-Bench’s evaluation protocol compares H²Memory to several baselines, each differing in memory and retrieval strategy:

Vanilla (with and without logs)
Turn-level RAG
Session-level RAG
RecurSum
ConditionMem
MemoryBank

Single-Turn QA Results:

Method	B-1	B-2	B-3	B-4	G-Score	S-Score
Vanilla (w/o log)	13.59	5.76	2.58	1.41	17.50	18.95
Vanilla (with log)	19.71	8.85	4.10	2.29	23.00	22.88
Turn-level RAG	22.74	10.54	4.94	2.69	26.85	24.09
Session-level RAG	23.81	11.24	5.42	3.06	29.33	33.78
RecurSum	23.29	10.64	4.95	2.75	28.36	25.61
ConditionMem	23.31	10.42	4.86	2.66	27.78	25.49
MemoryBank	23.89	11.11	5.23	2.91	28.57	29.85
H²Memory (ours)	26.67	12.18	5.68	3.09	32.54	38.32

In multi-turn dialogue interaction, H²Memory consistently achieves higher win/loss counts over all baselines for both requirement and preference understanding.

External Validation (LongMemEval, “single-session-preference” subset):

Ours ( $M_T+M_P$ ): 50.00% accuracy.
ConditionMem: 40.00%.
MemoryBank: 23.33%.

6. Insights, Limitations, and Future Directions

Empirical analyses from PAL-Bench indicate that explicit modeling of both abstract and concrete user information (logs, dialogues, background aspects, preference principles) yields consistent performance improvements. Gains are particularly pronounced in requirement restatement (BLEU-1 and G-Score, +3.21 over best baseline) and solution selection (+4.54 Selection Score vs. session-level RAG). In multi-turn interactions, H²Memory substantially improves user requirement and preference alignment.

Limitations include:

Occasional performance drops due to stochastic LLM role-play behavior and prompt sensitivity.
Potential under-representation of atypical real-world behaviors due to synthetic data generation.
Challenges in balancing memory component weighting, as ablation of $M_T$ most impacts requirement restatement and $M_B$ affects preference modeling.

Proposed directions:

Integration of multimodal context (e.g., screenshots, sensor data) into situation memory.
Online memory condensation to accommodate very long-term histories efficiently.
Incorporation of privacy-preserving real user data and fairness/bias control in memory modeling.
Evaluation in evolving persona and continuous learning scenarios.

PAL-Bench and its associated dataset, PAL-Set, establish a principled and reproducible testbed for research in memory-based, personalized service-oriented dialogue systems, directly supporting the development and comparative analysis of memory architectures such as H²Memory (Huang et al., 17 Nov 2025).

Markdown Upgrade to Chat

References (1)

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PAL-Bench.