WorthBuying Dataset: Purchase Reason Benchmark

Updated 31 October 2025

The WorthBuying Dataset is a comprehensive resource that annotates explicit and implicit purchase reasons alongside post-purchase sentiments from Amazon reviews.
It utilizes LLM-generated JSON explanations evaluated through automated and human annotations, ensuring low hallucination and high correctness.
The dataset supports benchmark tasks for personalized recommendation explanations, joint rating prediction, and detailed marketing analysis.

The WorthBuying Dataset—described in "Unlocking the ‘Why’ of Buying: Introducing a New Dataset and Benchmark for Purchase Reason and Post-Purchase Experience" (Chen et al., 20 Feb 2024)—is a large-scale resource focused on purchase reason prediction, offering systematic and explicit differentiation between pre-purchase motivation and post-purchase sentiment in user product reviews. It enables AI models, particularly LLMs, to benchmark explanation generation tasks for personalized recommendation, marketing analytics, and user modeling.

1. Dataset Definition and Structure

The dataset, termed the purchase reason explanation dataset, is drawn from the Amazon product review 5-core dataset (75 million reviews; users/items with minimum five reviews). The current version comprises a random sample of 10,000 reviews, spanning major product categories: Electronics, Books, Fashion (Clothing, Shoes & Jewelry), Home & Kitchen, and Sports & Outdoors.

Each record contains:

Product Information: Title and description
User Review: Original text
Explanations (LLM-generated JSON fields)
- Explicit Purchase Reason: Reason(s) directly stated in the review
- Implicit Purchase Reason: Inferred motivations deducible from context
- Purchase Reason Explanation: Rationale for identification of the above
- Post-Purchase Experience: User sentiment or factual experience after purchase

All explanations are concise, short-text, and categorized as personal or generic.

Field	Example Value	Type
explicit_purchase_reason	"To help aging parents who need a clock ..."	Short Text
implicit_purchase_reason	"Multiple alarm options (everyday, weekday only, weekend only)"	Short Text
purchase_reason_explanation	"Emphasizes readability and usability for elderly parents..."	Short Text
post_purchase_experience	"Parents love the clock, met expectations ..."	Short Text

2. Data Generation and Annotation Methodology

Source and Sampling

Source: Amazon 5-core reviews
Sample size: 10,000 reviews (limited for LLM computational cost)

LLM-based Generation

Model: Gemini Ultra (Google)
Prompting:
- Simultaneous request for purchase reasons and experiences, with clear distinction
- Explicit prompts for both direct and inferred (implicit) motivations
- Instructions to avoid hallucination and provide supporting evidence
- Output enforced in JSON for machine readability

Evaluation Protocol

Automated LLM Rater: Explanations rated by a secondary LLM along four axes:
1. Hallucination
2. Correctness (categorization)
3. Completeness
4. Personalization (specificity)
Human Annotation: 100 sample double-annotations compared against LLM, with high agreement on hallucination/correctness; LLM stricter on completeness but more generous on personalization.

Coverage/Quality:

Purchase reasons found in 96.4% of reviews (explicit: 61.8%, implicit: 70.9%)
Post-purchase experiences found in 88.2%
Personalization rates: 70.1% (reasons), 71.18% (experiences)
Hallucination <0.5%, Correctness >99%, Completeness ~77%

3. Benchmark Tasks and Evaluation Metrics

The WorthBuying Dataset supports multiple benchmarking tasks for personalized natural language explanation generation in recommender systems:

Task Definitions

Task 1: Generate explanation for recommending a specific item (user and item info provided)
Task 2: Generate explanation including user's rating
Task 3: Joint prediction of user rating and recommendation explanation

Representations

User: Past reviews (UserReview) or LLM-summarized profiles (UserProfile)
Item: Metadata, past reviews (ItemReview), or summary (ItemProfile)
Review/history truncation: max 10 most recent, max 8k tokens

Evaluation

Standard NLG metrics:

BLEU (formula: $\text{BLEU} = BP \cdot \exp \left( \sum_{n=1}^N w_n \log p_n \right)$ )
ROUGE-1, ROUGE-2, ROUGE-Lsum

4. Experimental Results and Data Characteristics

Overall Performance

Gemini Ultra (zero-shot) benchmarks:
- Best representations: User’s raw past reviews + item metadata
- Purchase Reason: BLEU 6.46, ROUGE-1 22.14, ROUGE-2 8.45, ROUGE-Lsum 20.38
- Post-Purchase Experience: BLEU 3.66, ROUGE-1 21.35, ROUGE-2 5.34, ROUGE-Lsum 16.21
- Summarizing user reviews attenuates personal detail
- Adding item reviews can introduce noise
- Variations in benchmark task (e.g., rating presence, joint tasks) produce minimal change in zero-shot LLM scores

Category-wise Analysis

Electronics, Home & Kitchen: Highest metric scores
Books, Fashion: Lower scores, attributably to more personal language/use cases and less explicit reasoning
Performance highest when purchase reasoning is strongly reflected in product metadata

5. Downstream Applications and Utility

Demonstrated applications include:

Recommendation Explanation Generation: Personalized and context-aware justifications for recommendations, both pre- and post-purchase
Justification of Recommendations: Enables systems to articulate why an item is recommended, preceding the purchase event
Marketing Analysis: Facilitates granular analysis of purchase intent versus satisfaction, supporting sophisticated segmentation and targeting
Recommender Systems: Potential for more credible, persuasive recommendation outputs by grounding explanations in both explicit and implicit user motives
User Modeling: Enables richer, context-dependent profile construction by capturing motivational and experiential information

6. Advances, Limitations, and Research Impact

Advances over Prior Work

Previous datasets relied on post-purchase or generic sentiment extraction, missing context and personal motivation
The WorthBuying Dataset is the first large-scale resource to annotate pre-purchase reasons and post-purchase experiences distinctly, including rationale fields
Enables direct benchmarking of pre-purchase explanation tasks, facilitating research that bridges the gap between generic recommendation rationales and personalized decision drivers

Potential Impact

Purchase Reason Prediction: Supports development and assessment of models capable of generating fine-grained, user-specific rationales
Explainable AI: Introduces rigorous criteria for explanation relevance and completeness beyond generic NLG fluency
Marketing Science: Provides infrastructure to model and understand factors influencing purchase intent, improving campaign efficacy
Personalization: Advances explainable recommendation research towards contextual, individualized justifications

7. Summary

The WorthBuying Dataset constitutes the inaugural benchmark for systematic, large-scale annotation and modeling of both explicit/implicit purchase reasons and post-purchase experiences. Leveraging high-fidelity LLM generation and evaluation, the resource enables robust explanation-based benchmarking, recommendation justification, and comprehensive marketing/user modeling research. It transitions the explainable recommendation field from post-hoc satisfaction explanation to pre-purchase motivational articulation, marking a significant methodological advance for recommendation science and marketing analytics.

PDF Markdown Chat (Pro)

References (1)

Unlocking the `Why' of Buying: Introducing a New Dataset and Benchmark for Purchase Reason and Post-Purchase Experience (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to WorthBuying Dataset.