PAL-Bench: Benchmark for Personalized Dialogue
- PAL-Bench is a benchmark that assesses personalized service-oriented dialogue systems through multi-session interactions and comprehensive user logs.
- It introduces PAL-Set, a synthetic Chinese-language dataset generated via an LLM pipeline and refined by human annotators to simulate realistic user experiences.
- H²Memory, a hierarchical and heterogeneous memory framework within PAL-Bench, enhances requirement restatement and solution selection, outperforming baseline models.
PAL-Bench is a benchmark for evaluating the personalization capabilities of service-oriented dialogue assistants in long-term user–agent interactions. It targets the assessment of assistants’ ability to understand and adapt to user-specific traits, evolving preferences, and behavioral history, with support for multi-session scenarios and complex requirement interpretation. Addressing the lack of public multi-session personalized dialogue benchmarks accompanied by long-term user logs, PAL-Bench introduces a synthetic dataset, PAL-Set, constructed via an LLM-centered pipeline and refined by human annotators. The benchmark is accompanied by the H²Memory framework, a hierarchical and heterogeneous memory architecture designed to model both concrete and abstract aspects of long-term user history, and demonstrates advances in requirement understanding and preference alignment over prior baselines (Huang et al., 17 Nov 2025).
1. Motivation and Benchmark Scope
PAL-Bench is designed in response to the increasing prevalence of service-oriented dialogue systems embedded in smart personal devices, where task fulfillment depends on accurate modeling of individual users’ requirements, goals, and personal context. Limitations of prior benchmarks include a focus on dialogue turns to the exclusion of behavioral logs, a homogeneous treatment of users, and a primary emphasis on factual knowledge retrieval. Furthermore, public datasets including user behavioral records remain unavailable due to privacy constraints, impeding reproducibility and controlled evaluation of memory-based personalization techniques.
PAL-Bench aims to fill these gaps by providing:
- Evaluation on underspecified, subjective requirements where long-term user context is essential for interpretation.
- Multi-session and multi-turn dialogues incorporating both device/app interaction logs and dialogue histories.
- Stratified simulation of objective (logs, events) and subjective (preferences, historical dialogues) user data, supporting fine-grained evaluation across requirement restatement, solution proposal, and multi-turn dialogue.
2. Evaluation Tasks and Metrics
PAL-Bench defines three core evaluation tasks:
- Requirement Restatement (Single-Turn QA):
- Input: User’s initial underspecified query with long-term memory context.
- Output: Single-sentence “complete requirement”—a restatement integrating implicit user needs.
- Metrics: BLEU-1 through BLEU-4 (against ground truth), and GPT-4 Score (0–2 scaled to 0–100) assessing implicit need capture.
- Solution Proposal (Single-Turn QA):
- Subtask A: Generate one-sentence solution for given complete requirement.
- Subtask B: Select the 2 best solutions from a list of 8 candidates.
- Metrics: BLEU-1 to BLEU-4 for generation (against two positive references); Selection Score in [−100, 100] based on correct/incorrect picks.
- Multi-Turn Dialogue Interaction:
- Role-played by a User-LLM, assessed by an Evaluation-LLM on:
- Requirement understanding
- Preference understanding
- Metric: Pairwise Win–Tie–Lose counts, following FairEval (six shuffles per comparison). Human ratings are used for correlation analysis.
- Role-played by a User-LLM, assessed by an Evaluation-LLM on:
These tasks are designed to systematically evaluate both the assistant’s capacity to recover latent requirements and to generate or select solutions consistent with historical user preferences.
3. PAL-Set Dataset Construction
PAL-Set is the foundational synthetic dataset for PAL-Bench, comprising Chinese-language multi-session user logs and dialogue histories. The construction process involves:
- Profile and Persona Generation: Each profile specifies gender, age, Big Five personality traits (high/medium/low), and brief aspect-based descriptions (work, health, family, leisure). Expanded personas include monthly timelines with objective events (6–12 months) and for each aspect, 4–5 abstract requirement types annotated with positive/negative detailed preferences.
- Session-Specific Scenario Expansion: Each timeline month is decomposed into 4–6 situations (≥5 sentences/situation), mapped to requirement types. Diary-style experience narratives ground the objective data, and dialogue frameworks are constructed comprising user_query, implicit_needs, and combined requirement. For each, eight candidate solutions are generated and labeled (two positive, two negative) via a two-stage LLM prompt.
- Interaction Log Synthesis: Simulated using eight predefined log types (Web Search, Device Operation, etc.), each session features at least 20 log entries. Multi-turn, multi-topic dialogues are crafted using sequence templates to ensure contextual consistency.
- Verification and Refinement: Automated checks confirm format and logical consistency; human annotators validate log-type matching, persona alignment, and coherence.
Dataset statistics (per 100 users):
| Statistic | History | Query |
|---|---|---|
| Avg. # sessions | 25.7 | 3.3 |
| Avg. # logs | 888.7 | 107.5 |
| Avg. # dialogue turns | 361.7 | 39.3 |
| Avg. # dialogue topics | 62.5 | 8.3 |
| Avg. # months | 8.4 | 1.0 |
Human evaluation of 50 sessions (3 annotators, 1–3 scale): Logs 2.75, Dialogues 2.67, indicating high profile consistency.
4. H²Memory: Hierarchical and Heterogeneous Memory Framework
H²Memory is architected to support PAL-Bench’s complex evaluation by abstracting and retrieving multi-level user information:
- (Situation Memory): Organizes logs into session-level subgraph “situations,” constructed from “caused_by” or “follows” relations (LLM-inferred), each summarized into a node.
- (Background Memory): Aggregates aspect-wise summaries (“work”, “health”, “family”, “leisure”), recursively updated after every session using .
- (Topic Memory): For each dialogue topic, stores (requirement, solutions + feedback, preference). Requirements are refined using retrieval of -nearest situations.
- (Principle Memory): Encodes abstract requirement types () and preference principles (), obtained by clustering topic requirements and summarizing via LLM. Updates dynamically with new inputs.
Retrieval mechanism employs cosine similarity: and response generation is formulated as: where is the collection of retrieved memory items.
5. Experimental Results and Comparative Analysis
PAL-Bench’s evaluation protocol compares H²Memory to several baselines, each differing in memory and retrieval strategy:
- Vanilla (with and without logs)
- Turn-level RAG
- Session-level RAG
- RecurSum
- ConditionMem
- MemoryBank
Single-Turn QA Results:
| Method | B-1 | B-2 | B-3 | B-4 | G-Score | S-Score |
|---|---|---|---|---|---|---|
| Vanilla (w/o log) | 13.59 | 5.76 | 2.58 | 1.41 | 17.50 | 18.95 |
| Vanilla (with log) | 19.71 | 8.85 | 4.10 | 2.29 | 23.00 | 22.88 |
| Turn-level RAG | 22.74 | 10.54 | 4.94 | 2.69 | 26.85 | 24.09 |
| Session-level RAG | 23.81 | 11.24 | 5.42 | 3.06 | 29.33 | 33.78 |
| RecurSum | 23.29 | 10.64 | 4.95 | 2.75 | 28.36 | 25.61 |
| ConditionMem | 23.31 | 10.42 | 4.86 | 2.66 | 27.78 | 25.49 |
| MemoryBank | 23.89 | 11.11 | 5.23 | 2.91 | 28.57 | 29.85 |
| H²Memory (ours) | 26.67 | 12.18 | 5.68 | 3.09 | 32.54 | 38.32 |
In multi-turn dialogue interaction, H²Memory consistently achieves higher win/loss counts over all baselines for both requirement and preference understanding.
External Validation (LongMemEval, “single-session-preference” subset):
- Ours (): 50.00% accuracy.
- ConditionMem: 40.00%.
- MemoryBank: 23.33%.
6. Insights, Limitations, and Future Directions
Empirical analyses from PAL-Bench indicate that explicit modeling of both abstract and concrete user information (logs, dialogues, background aspects, preference principles) yields consistent performance improvements. Gains are particularly pronounced in requirement restatement (BLEU-1 and G-Score, +3.21 over best baseline) and solution selection (+4.54 Selection Score vs. session-level RAG). In multi-turn interactions, H²Memory substantially improves user requirement and preference alignment.
Limitations include:
- Occasional performance drops due to stochastic LLM role-play behavior and prompt sensitivity.
- Potential under-representation of atypical real-world behaviors due to synthetic data generation.
- Challenges in balancing memory component weighting, as ablation of most impacts requirement restatement and affects preference modeling.
Proposed directions:
- Integration of multimodal context (e.g., screenshots, sensor data) into situation memory.
- Online memory condensation to accommodate very long-term histories efficiently.
- Incorporation of privacy-preserving real user data and fairness/bias control in memory modeling.
- Evaluation in evolving persona and continuous learning scenarios.
PAL-Bench and its associated dataset, PAL-Set, establish a principled and reproducible testbed for research in memory-based, personalized service-oriented dialogue systems, directly supporting the development and comparative analysis of memory architectures such as H²Memory (Huang et al., 17 Nov 2025).