WorthBuying Dataset: Purchase Reason Benchmark
- The WorthBuying Dataset is a comprehensive resource that annotates explicit and implicit purchase reasons alongside post-purchase sentiments from Amazon reviews.
- It utilizes LLM-generated JSON explanations evaluated through automated and human annotations, ensuring low hallucination and high correctness.
- The dataset supports benchmark tasks for personalized recommendation explanations, joint rating prediction, and detailed marketing analysis.
The WorthBuying Dataset—described in "Unlocking the ‘Why’ of Buying: Introducing a New Dataset and Benchmark for Purchase Reason and Post-Purchase Experience" (Chen et al., 20 Feb 2024)—is a large-scale resource focused on purchase reason prediction, offering systematic and explicit differentiation between pre-purchase motivation and post-purchase sentiment in user product reviews. It enables AI models, particularly LLMs, to benchmark explanation generation tasks for personalized recommendation, marketing analytics, and user modeling.
1. Dataset Definition and Structure
The dataset, termed the purchase reason explanation dataset, is drawn from the Amazon product review 5-core dataset (75 million reviews; users/items with minimum five reviews). The current version comprises a random sample of 10,000 reviews, spanning major product categories: Electronics, Books, Fashion (Clothing, Shoes & Jewelry), Home & Kitchen, and Sports & Outdoors.
Each record contains:
- Product Information: Title and description
- User Review: Original text
- Explanations (LLM-generated JSON fields)
- Explicit Purchase Reason: Reason(s) directly stated in the review
- Implicit Purchase Reason: Inferred motivations deducible from context
- Purchase Reason Explanation: Rationale for identification of the above
- Post-Purchase Experience: User sentiment or factual experience after purchase
All explanations are concise, short-text, and categorized as personal or generic.
| Field | Example Value | Type |
|---|---|---|
| explicit_purchase_reason | "To help aging parents who need a clock ..." | Short Text |
| implicit_purchase_reason | "Multiple alarm options (everyday, weekday only, weekend only)" | Short Text |
| purchase_reason_explanation | "Emphasizes readability and usability for elderly parents..." | Short Text |
| post_purchase_experience | "Parents love the clock, met expectations ..." | Short Text |
2. Data Generation and Annotation Methodology
Source and Sampling
- Source: Amazon 5-core reviews
- Sample size: 10,000 reviews (limited for LLM computational cost)
LLM-based Generation
- Model: Gemini Ultra (Google)
- Prompting:
- Simultaneous request for purchase reasons and experiences, with clear distinction
- Explicit prompts for both direct and inferred (implicit) motivations
- Instructions to avoid hallucination and provide supporting evidence
- Output enforced in JSON for machine readability
Evaluation Protocol
- Automated LLM Rater: Explanations rated by a secondary LLM along four axes:
- Hallucination
- Correctness (categorization)
- Completeness
- Personalization (specificity)
Human Annotation: 100 sample double-annotations compared against LLM, with high agreement on hallucination/correctness; LLM stricter on completeness but more generous on personalization.
Coverage/Quality:
- Purchase reasons found in 96.4% of reviews (explicit: 61.8%, implicit: 70.9%)
- Post-purchase experiences found in 88.2%
- Personalization rates: 70.1% (reasons), 71.18% (experiences)
- Hallucination <0.5%, Correctness >99%, Completeness ~77%
3. Benchmark Tasks and Evaluation Metrics
The WorthBuying Dataset supports multiple benchmarking tasks for personalized natural language explanation generation in recommender systems:
Task Definitions
- Task 1: Generate explanation for recommending a specific item (user and item info provided)
- Task 2: Generate explanation including user's rating
- Task 3: Joint prediction of user rating and recommendation explanation
Representations
- User: Past reviews (UserReview) or LLM-summarized profiles (UserProfile)
- Item: Metadata, past reviews (ItemReview), or summary (ItemProfile)
- Review/history truncation: max 10 most recent, max 8k tokens
Evaluation
Standard NLG metrics:
- BLEU (formula: )
- ROUGE-1, ROUGE-2, ROUGE-Lsum
4. Experimental Results and Data Characteristics
Overall Performance
- Gemini Ultra (zero-shot) benchmarks:
- Best representations: User’s raw past reviews + item metadata
- Purchase Reason: BLEU 6.46, ROUGE-1 22.14, ROUGE-2 8.45, ROUGE-Lsum 20.38
- Post-Purchase Experience: BLEU 3.66, ROUGE-1 21.35, ROUGE-2 5.34, ROUGE-Lsum 16.21
- Summarizing user reviews attenuates personal detail
- Adding item reviews can introduce noise
- Variations in benchmark task (e.g., rating presence, joint tasks) produce minimal change in zero-shot LLM scores
Category-wise Analysis
- Electronics, Home & Kitchen: Highest metric scores
- Books, Fashion: Lower scores, attributably to more personal language/use cases and less explicit reasoning
- Performance highest when purchase reasoning is strongly reflected in product metadata
5. Downstream Applications and Utility
Demonstrated applications include:
- Recommendation Explanation Generation: Personalized and context-aware justifications for recommendations, both pre- and post-purchase
- Justification of Recommendations: Enables systems to articulate why an item is recommended, preceding the purchase event
- Marketing Analysis: Facilitates granular analysis of purchase intent versus satisfaction, supporting sophisticated segmentation and targeting
- Recommender Systems: Potential for more credible, persuasive recommendation outputs by grounding explanations in both explicit and implicit user motives
- User Modeling: Enables richer, context-dependent profile construction by capturing motivational and experiential information
6. Advances, Limitations, and Research Impact
Advances over Prior Work
- Previous datasets relied on post-purchase or generic sentiment extraction, missing context and personal motivation
- The WorthBuying Dataset is the first large-scale resource to annotate pre-purchase reasons and post-purchase experiences distinctly, including rationale fields
- Enables direct benchmarking of pre-purchase explanation tasks, facilitating research that bridges the gap between generic recommendation rationales and personalized decision drivers
Potential Impact
- Purchase Reason Prediction: Supports development and assessment of models capable of generating fine-grained, user-specific rationales
- Explainable AI: Introduces rigorous criteria for explanation relevance and completeness beyond generic NLG fluency
- Marketing Science: Provides infrastructure to model and understand factors influencing purchase intent, improving campaign efficacy
- Personalization: Advances explainable recommendation research towards contextual, individualized justifications
7. Summary
The WorthBuying Dataset constitutes the inaugural benchmark for systematic, large-scale annotation and modeling of both explicit/implicit purchase reasons and post-purchase experiences. Leveraging high-fidelity LLM generation and evaluation, the resource enables robust explanation-based benchmarking, recommendation justification, and comprehensive marketing/user modeling research. It transitions the explainable recommendation field from post-hoc satisfaction explanation to pre-purchase motivational articulation, marking a significant methodological advance for recommendation science and marketing analytics.