Alibaba's Personalized Re-ranking Dataset
- Alibaba's Personalized Re-ranking Dataset is a large-scale, session-based resource capturing explicit search queries, browsing, and purchase events for personalized e-commerce ranking evaluation.
- The dataset includes over 570K sessions with detailed logs such as 1.1M clicks and over 18K purchase records, enabling robust analysis with metrics like CTR, AP, and MAP.
- Integrating statistical, query-item, and session features, the resource supports advanced ensemble modeling approaches, including stacking base learners and meta-learners to boost ranking accuracy.
Alibaba's Personalized Re-ranking Dataset is a large-scale, session-based resource constructed to facilitate the research and evaluation of personalized search and ranking models in e-commerce environments. Designed to benchmark and develop methods for the CIKM Cup 2016 Personalized E-Commerce Search Challenge, it captures search, browsing, and purchase events at the interaction level, supporting the construction of state-of-the-art ranking systems using advanced machine learning and ensemble strategies (Wu et al., 2017).
1. Dataset Schema and Distributions
The dataset comprises comprehensive logs collected over a five-month window (January 1 – June 1, 2016), systematically segmented into train and test partitions. It aggregates:
- 53,427 query-full logs (explicit search queries)
- 869,700 query-less logs (category browsing without a query)
- 573,935 total sessions
- 130,987 unique presented products (itemID)
- 1,127,764 click records; 1,235,380 view/browse records; 18,025 purchase records
- 232,817 real users (with userID), 333,097 anonymous users
Each session is specified by sessionID and contains sequential user interactions. The core data files and their essential columns include:
| File | Key Columns | Event Type |
|---|---|---|
| train-queries.csv | userID (nullable), sessionID, timestamp, queryID (or categoryID), query tokens, presented itemID list, SERP positions | Query/impression |
| clicks/views/purchases.csv | userID, sessionID, itemID, timestamp, (purchase_amount for purchases) | Actions |
| products.csv | itemID, categoryID, description (tokenized), price | Item catalog |
The dataset is highly sparse: the mean word overlap between queries and product descriptions is only ~1.8%. Activity follows long-tailed distributions—over 80% of items record fewer than five clicks, approximately half of users issue fewer than three queries in the train split, and session lengths have mean ≈5, median 3, with a tail extending to 50+.
2. Feature Construction
Feature engineering is stratified into three main groups: statistic features, query–item features, and session features.
Statistic Features are derived at both global and local (time-windowed) granularities. Important constructions include:
- Per-item click-through rate (CTR):
- View rate (VR) and conversion rate (CVR):
- Time-localized statistics are defined as event counts within L partitions of the full logging period. Price-normalized behaviors are incorporated as:
Query–Item Features leverage both exact token indicators and embedding-based semantic matching:
- Category-based binary token indicators, , denote token presence in item descriptions within category .
- Cross-token features, , represent co-occurrence of specific query and item description words.
- For semantic matching, each word is embedded and the pairwise match score is:
Session Features are constructed around user interaction recency and repetition:
- Repeat-item indicators capture in-session re-click/view behaviors.
- Recency windows count event occurrences for item within a trailing -minute window.
- Dwell time is the elapsed time between entering and leaving an item impression.
3. Model Classes and Associated Losses
Multiple model families are applicable, each exploiting different granularities of supervision:
- Logistic Regression (pointwise): Models per triple click probability with cross-entropy loss:
- Gradient Boosted Decision Trees (GBDT): Learns additive regression trees via incremental function optimization with regularization on leaf weights:
- RankSVM (pairwise): Learns ranking via hinge loss over relevant/irrelevant pairs:
- Deep Match Model (DMM): Utilizes a two-tower embedding architecture; the final score is cosine similarity or bilinear with matrix :
With pairwise hinge loss between clicked and non-clicked item pairs.
4. Ensemble Framework: Stacking and Out-of-Fold Blending
Model performance is enhanced through a stacking ensemble architecture.
- Base Learners: LR, RankSVM, DMM, and GBDT are trained using the full feature suite. Training queries are split into folds, each base model is trained on folds, and their predictions on the held-out fold are collected.
- Meta-Learner: Out-of-fold scores for each across all models form a feature vector input to a top-level learner (typically GBDT or LR), which is trained to predict the target.
- Inference Procedure: For each new query–item triple, base model outputs are computed and concatenated, then the meta-learner's function produces the final re-ranking score:
5. Evaluation Methodology and Validation Practices
Assessment within this paradigm uses standard IR and ranking metrics:
- Precision@K ():
- Average Precision (AP) and Mean Average Precision (MAP):
- Discounted Cumulative Gain and Normalized DCG@K:
Validation employs time-based splits: training on January–April, validating on May, and testing on June logs to prevent temporal leakage. Sessions are kept intact within folds, and per-category/user coverage is monitored to avoid data skew.
6. Practical Recommendations for Large-Scale Personalized Ranking
Feature engineering should combine global and local statistics to account for both persistent and transient patterns. Session-level features (e.g., repeat and recency indicators) and semantic representations via embedding models (DMM) are vital to overcome lexical sparsity and encode user intent. Model blending across diverse algorithms (pointwise, pairwise, trees, deep nets) via out-of-fold stacking is essential to harness heterogeneous strengths while minimizing overfitting risks.
Operational best practices encompass:
- Pre-computing and caching per-item statistics (CTR, CVR)
- Sparse representation and hashing for high-dimensional cross-token spaces
- Parallelized computation over sessions/users
- Incremental updates of statistics and learned embeddings to track real-time dynamics
By adhering to these principles—from foundational data processing, robust feature design, advanced model formulation, carefully engineered ensembles, through rigorous temporal evaluation—researchers can develop personalized re-ranking systems at Alibaba's operational scale and rigor (Wu et al., 2017).