Alibaba's Personalized Re-ranking Dataset

Updated 19 December 2025

Alibaba's Personalized Re-ranking Dataset is a large-scale, session-based resource capturing explicit search queries, browsing, and purchase events for personalized e-commerce ranking evaluation.
The dataset includes over 570K sessions with detailed logs such as 1.1M clicks and over 18K purchase records, enabling robust analysis with metrics like CTR, AP, and MAP.
Integrating statistical, query-item, and session features, the resource supports advanced ensemble modeling approaches, including stacking base learners and meta-learners to boost ranking accuracy.

Alibaba's Personalized Re-ranking Dataset is a large-scale, session-based resource constructed to facilitate the research and evaluation of personalized search and ranking models in e-commerce environments. Designed to benchmark and develop methods for the CIKM Cup 2016 Personalized E-Commerce Search Challenge, it captures search, browsing, and purchase events at the interaction level, supporting the construction of state-of-the-art ranking systems using advanced machine learning and ensemble strategies (Wu et al., 2017).

1. Dataset Schema and Distributions

The dataset comprises comprehensive logs collected over a five-month window (January 1 – June 1, 2016), systematically segmented into train and test partitions. It aggregates:

53,427 query-full logs (explicit search queries)
869,700 query-less logs (category browsing without a query)
573,935 total sessions
130,987 unique presented products (itemID)
1,127,764 click records; 1,235,380 view/browse records; 18,025 purchase records
232,817 real users (with userID), 333,097 anonymous users

Each session is specified by sessionID and contains sequential user interactions. The core data files and their essential columns include:

File	Key Columns	Event Type
train-queries.csv	userID (nullable), sessionID, timestamp, queryID (or categoryID), query tokens, presented itemID list, SERP positions	Query/impression
clicks/views/purchases.csv	userID, sessionID, itemID, timestamp, (purchase_amount for purchases)	Actions
products.csv	itemID, categoryID, description (tokenized), price	Item catalog

The dataset is highly sparse: the mean word overlap between queries and product descriptions is only ~1.8%. Activity follows long-tailed distributions—over 80% of items record fewer than five clicks, approximately half of users issue fewer than three queries in the train split, and session lengths have mean ≈5, median 3, with a tail extending to 50+.

2. Feature Construction

Feature engineering is stratified into three main groups: statistic features, query–item features, and session features.

Statistic Features are derived at both global and local (time-windowed) granularities. Important constructions include:

Per-item click-through rate (CTR):

$CTR_{i} = \frac{\sum_{u} \text{click}_{u,i}}{\sum_{u} \text{impr}_{u,i}}$

View rate (VR) and conversion rate (CVR):

$VR_{i} = \frac{\sum_{u} \text{view}_{u,i}}{\sum_{u} \text{impr}_{u,i}},\quad CVR_{i} = \frac{\sum_{u} \text{purchase}_{u,i}}{\sum_{u} \text{click}_{u,i}}$

Time-localized statistics are defined as event counts within L partitions of the full logging period. Price-normalized behaviors are incorporated as:

$product\_price\_feat_i = \frac{\#\text{behavior}_i}{\text{price}_i + 1}$

Query–Item Features leverage both exact token indicators and embedding-based semantic matching:

Category-based binary token indicators, $x_t^c$ , denote token presence in item descriptions within category $c$ .
Cross-token features, $x_{q_t,p_s}$ , represent co-occurrence of specific query and item description words.
For semantic matching, each word is embedded and the pairwise match score is:

$s_{\text{match}}(q,i) = \sum_{t,s} \sigma\bigl(\langle \mathbf{e}_{q_t}, \mathbf{e}_{p_s}\rangle\bigr)$

Session Features are constructed around user interaction recency and repetition:

Repeat-item indicators capture in-session re-click/view behaviors.
Recency windows count event occurrences for item $i$ within a trailing $W$ -minute window.
Dwell time is the elapsed time between entering and leaving an item impression.

3. Model Classes and Associated Losses

Multiple model families are applicable, each exploiting different granularities of supervision:

Logistic Regression (pointwise): Models per $(u, q, i)$ triple click probability with cross-entropy loss:

$p_{uqi} = \sigma(\mathbf{w}^T \mathbf{x}_{uqi}),\quad L_{LR} = -\sum_{(u,q,i)} [y_{uqi}\log p_{uqi} + (1-y_{uqi})\log(1-p_{uqi})]$

Gradient Boosted Decision Trees (GBDT): Learns additive regression trees via incremental function optimization with regularization on leaf weights:

$F_t(x) = F_{t-1}(x) + f_t(x), \quad \mathcal{L} = \sum_i l(y_i,F_{t-1}(x_i) + f_t(x_i)) + \Omega(f_t)$

$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2$

RankSVM (pairwise): Learns ranking via hinge loss over relevant/irrelevant pairs:

$L_{\mathrm{RankSVM}} = \sum_{(q,i^+,i^-)} \max\bigl(0,\,1 - f(q,i^+) + f(q,i^-)\bigr)$

Deep Match Model (DMM): Utilizes a two-tower embedding architecture; the final score is cosine similarity or bilinear with matrix $W$ :

$s_{\mathrm{DMM}}(q,i) = \cos(h_q, h_i)\quad\text{or}\quad h_q^T W h_i$

With pairwise hinge loss between clicked and non-clicked item pairs.

4. Ensemble Framework: Stacking and Out-of-Fold Blending

Model performance is enhanced through a stacking ensemble architecture.

Base Learners: LR, RankSVM, DMM, and GBDT are trained using the full feature suite. Training queries are split into $K$ folds, each base model is trained on $K-1$ folds, and their predictions on the held-out fold are collected.
Meta-Learner: Out-of-fold scores for each $(u, q, i)$ across all models form a feature vector input to a top-level learner (typically GBDT or LR), which is trained to predict the target.
Inference Procedure: For each new query–item triple, base model outputs are computed and concatenated, then the meta-learner's function $g$ produces the final re-ranking score:

$F_{\mathrm{final}}(u,q,i) = g(f^{(1)}_1, \dots, f^{(1)}_M)$

5. Evaluation Methodology and Validation Practices

Assessment within this paradigm uses standard IR and ranking metrics:

Precision@K ( $P@K$ ):

$P@K(q) = \frac{1}{K}\sum_{i=1}^{K} \mathbf{1}[\mathrm{item}_i\ \mathrm{is\ relevant}]$

Average Precision (AP) and Mean Average Precision (MAP):

$AP(q) = \frac{1}{R} \sum_{k=1}^{N} P@k(q)\, \mathbf{1}[\mathrm{rank}_k\ \mathrm{is\ relevant}]$

$MAP = \frac{1}{|\mathcal{Q}|} \sum_{q \in \mathcal{Q}} AP(q)$

Discounted Cumulative Gain and Normalized DCG@K:

$DCG@K(q) = \sum_{i=1}^{K} \frac{2^{rel_i}-1}{\log_2(i+1)}$

$NDCG@K(q) = \frac{DCG@K(q)}{IDCG@K(q)}$

Validation employs time-based splits: training on January–April, validating on May, and testing on June logs to prevent temporal leakage. Sessions are kept intact within folds, and per-category/user coverage is monitored to avoid data skew.

6. Practical Recommendations for Large-Scale Personalized Ranking

Feature engineering should combine global and local statistics to account for both persistent and transient patterns. Session-level features (e.g., repeat and recency indicators) and semantic representations via embedding models (DMM) are vital to overcome lexical sparsity and encode user intent. Model blending across diverse algorithms (pointwise, pairwise, trees, deep nets) via out-of-fold stacking is essential to harness heterogeneous strengths while minimizing overfitting risks.

Operational best practices encompass:

Pre-computing and caching per-item statistics (CTR, CVR)
Sparse representation and hashing for high-dimensional cross-token spaces
Parallelized computation over sessions/users
Incremental updates of statistics and learned embeddings to track real-time dynamics

By adhering to these principles—from foundational data processing, robust feature design, advanced model formulation, carefully engineered ensembles, through rigorous temporal evaluation—researchers can develop personalized re-ranking systems at Alibaba's operational scale and rigor (Wu et al., 2017).

PDF Markdown Chat (Pro)

References (1)

Ensemble Methods for Personalized E-Commerce Search Challenge at CIKM Cup 2016 (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Alibaba's Personalized Re-ranking Dataset.

Alibaba's Personalized Re-ranking Dataset

1. Dataset Schema and Distributions

2. Feature Construction

3. Model Classes and Associated Losses

4. Ensemble Framework: Stacking and Out-of-Fold Blending

5. Evaluation Methodology and Validation Practices

6. Practical Recommendations for Large-Scale Personalized Ranking

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Alibaba's Personalized Re-ranking Dataset

1. Dataset Schema and Distributions

2. Feature Construction

3. Model Classes and Associated Losses

4. Ensemble Framework: Stacking and Out-of-Fold Blending

5. Evaluation Methodology and Validation Practices

6. Practical Recommendations for Large-Scale Personalized Ranking

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research