- The paper introduces an explicit customer representation that replaces opaque embeddings with human-readable and robust personas.
- The paper presents GPLR, a framework that uses Diversity-Uncertainty sampling and LLM labeling combined with random walk-based affinity inference to scale persona generation.
- Empirical results demonstrate up to 12% improvements in recommendation accuracy and superior customer segmentation over traditional RFM models.
The paper "You Are What You Bought: Generating Customer Personas for E-commerce Applications" (2504.17304) addresses the need for explicit, human-readable customer representations in e-commerce applications, contrasting them with prevalent implicit representations like embeddings. Existing implicit methods are difficult to understand and integrate with external knowledge, limiting their utility in tasks like segmentation, search, and recommendation. While surveys are time-consuming and product category-based representations are often insufficient, the paper proposes using customer personas as a more informative, readable, and robust explicit representation.
A customer persona is defined as a multi-faceted characterization of a customer's specific purchase behaviors and preferences (e.g., "Bargain Hunters," "Health Enthusiasts"), condensed from their purchase history. Personas are considered more durable than short-term purchase intentions. The goal is to assign multiple relevant personas from a predefined set to each customer, represented by a binary matrix Y∈{0,1}∣U∣×∣R∣, where ∣U∣ is the number of users and ∣R∣ is the number of personas.
To generate these personas cost-effectively, the authors propose GPLR (Generates customers' Persona representation matrix through leveraging LLMs and Random walk-based affinities). The core challenge is the high cost of using LLMs like GPT-4 to label potentially millions of users based on their purchase histories. GPLR mitigates this by labeling only a small subset of users using LLMs and then inferring personas for the vast majority of unlabeled users based on their proximity to labeled users in the purchase history graph.
The GPLR framework proceeds in iterations:
- DUSample: Selects a small set of users to be labeled by LLMs. It uses a Diversity-Uncertainty (DU) sampling strategy that considers both the diversity of personas in the already labeled set and the uncertainty of the unlabeled users based on their current user-persona affinity scores. This aims to select users who are more likely to reveal less common personas and those whose persona affiliations are ambiguous.
- LLMAnswer: Sends the purchase histories of the sampled users and the predefined persona set R to an LLM (e.g., GPT-4) for labeling. Few-shot learning with prompt engineering is used to guide the LLM in assigning personas based on purchase patterns.
- AffinityCompute: Computes a user-persona affinity matrix A∈R∣U∣×∣R∣ for all users based on the labeled prototype users Us. This step is crucial for inferring personas for unlabeled users. The affinity is computed using a random walk-based approach on the user-item bipartite graph G. The exact affinity between user ui and persona rm is defined based on the probability of a random walk starting from ui reaching a prototype user uk with persona rm, aggregated over all prototype users and steps ≤ℓ^. A de-biasing coefficient is introduced to emphasize minor personas. The core computation involves powers of the user-user transition matrix derived from G.
The exact computation of the affinity matrix Π (related to random walk probabilities) is computationally expensive, with a time complexity of O(∣E∣⋅∣U∣+(ℓ^−1)∣U∣3+∣R∣⋅∣U∣2), especially for large graphs due to dense matrix multiplications. To address this scalability issue, the paper proposes RevAff, a fast approximation method for AffinityCompute.
RevAff computes an ϵ-approximate user-persona affinity matrix A^ using a reverse random walk strategy. Instead of computing the full user-user attention matrix Π, RevAff iteratively estimates the contributions of prototype users' labels to other nodes in the graph. It uses an asynchronous updating scheme based on the recurrence relation of random walk probabilities. The process continues until the residual mass in the temporary vectors falls below a threshold related to the error tolerance ϵ.
RevAff provides theoretical guarantees for the approximation error (Theorem 2) and significantly improves the time complexity to O(ϵ21(Nlog(N)+∣E∣)), where N=∣U∣+∣V∣ (Theorem 3). This makes the computation feasible for large-scale datasets.
The generated persona representations Y are then used for downstream applications. The paper focuses primarily on Product Recommendation and Customer Segmentation.
For product recommendation, the persona-based representation is integrated into existing graph convolution-based collaborative filtering models (like LGCN (2022.01754) and AFDGCF (Agrawal et al., 8 Mar 2024)). This is done by transforming the user-item bipartite graph into a user-item-persona tripartite graph. Persona nodes are added, and edges are created between users and the personas assigned to them (based on Y), and between items and the personas they pertain to (determined by querying an LLM). Graph convolution is then performed on this tripartite graph. This allows persona information to influence the learned user and item embeddings.
The experimental evaluation, conducted on three real-world e-commerce datasets (OnlineRetail, Instacart, Instacart Full), demonstrates the effectiveness and efficiency of the proposed methods:
- Recommendation Performance (RQ1): Integrating personas into LGCN (creating LGCN3) and AFDGCF (creating A-LGCN3) significantly improves recommendation accuracy (NDCG@K and F1-Score@K) compared to baseline methods, including the original LGCN and AFDGCF models. Improvements of up to 12% in NDCG@K and F1-Score@K are observed. LGCN3 and A-LGCN3 also show better scalability than some large-scale GCN models on the largest dataset. A case paper illustrates how personas help recommend less popular items aligning with specific user preferences (e.g., baby food), reducing bias towards popular items. Personas are also shown to be more effective than simpler category-based representations.
- Ablation Studies (RQ2): The performance of LGCN3 is relatively insensitive to the choice of LLM (GPT-4 vs. Llama-3) and surprisingly robust even with a low LLM sampling budget (as low as 5% of users). This indicates that the random walk-based inference effectively propagates persona information. The random walk length ℓ^=1 performed slightly better than ℓ^=2 in experiments, possibly due to diffusion effects diminishing signal distinctiveness at longer steps.
- Customer Segmentation (RQ3): Personas are evaluated for customer segmentation quality. They demonstrate superior robustness over time compared to the traditional RFM (Recency, Frequency, Monetary) model. Clustering users based on persona representations yields significantly better Silhouette Scores than clustering based on RFM, indicating higher cluster quality.
- Approximate Solution Evaluation (RQ4): RevAff achieves substantial speedups (more than 5x faster) on the large Instacart Full dataset compared to the exact random walk computation, while maintaining low empirical error, well below the theoretical tolerance ϵ.
In summary, the paper introduces a novel explicit customer representation using personas, proposes a cost-effective method (GPLR) combining LLMs and random walks for persona generation, and develops an efficient approximation algorithm (RevAff) for large-scale affinity computation. The persona-based representations are shown to effectively enhance product recommendation accuracy and customer segmentation quality in real-world e-commerce scenarios. Future work includes exploring other e-commerce applications and user interaction types.