Session-Based CTR Prediction via BiLSTM

Updated 1 January 2026

The paper introduces a session-based CTR prediction model that segments user behaviors into temporally coherent sessions and uses self-attention enhanced bidirectional LSTM to capture intra- and inter-session dynamics.
Its methodology integrates bias-encoded multi-head self-attention and a local activation unit within the BiLSTM framework, achieving notable AUC improvements (e.g., from 0.6343 to 0.6375) on industrial datasets.
The significance lies in unifying detailed session segmentation with advanced deep learning techniques, making the approach adaptable for applications in both advertising and recommendation systems.

Session-based prediction via Bidirectional LSTM is a framework for modeling user behavior sequences in click-through rate (CTR) prediction. This approach decomposes long-term behavior histories into temporally coherent sessions, extracts session-level interest representations using self-attention with bias encoding, and captures inter-session interest dynamics through bidirectional LSTM. The Deep Session Interest Network (DSIN) exemplifies this methodology by further weighting session contributions to prediction using a local activation unit, resulting in improved performance over previous models in industrial advertising and recommender system applications (Feng et al., 2019).

1. Session Segmentation and Representation

User behavior histories, denoted $S = [b_1, b_2, \dots, b_N] \in \mathbb{R}^{N\times d_\mathrm{model}}$ , are partitioned into a sequence of sessions $\{Q_1, \dots, Q_K\}$ based on temporal proximity. A new session is initiated when the time gap between consecutive behaviors exceeds 30 minutes, i.e., if $\mathrm{time}(b_{i+1}) - \mathrm{time}(b_i) > 30\,\mathrm{min}$ . The $k$ -th session is $Q_k = [b_{k,1}, b_{k,2}, \dots, b_{k,T}] \in \mathbb{R}^{T\times d_\mathrm{model}}$ , where up to $T$ user actions are retained per session. This segmentation is grounded on the empirical observation that behaviors are highly homogeneous within sessions but heterogeneous across sessions.

2. Intra-Session Interest Extraction via Biased Multi-Head Self-Attention

Each session embedding $Q_k$ is enhanced by a three-way bias encoding to inject: (a) session index $k$ , (b) time-step $t$ within the session, and (c) embedding dimension $c$ . The bias tensor is $BE_{k,t,c} = w_k^K + w_t^T + w_c^C$ , with $w^K \in \mathbb{R}^K$ , $w^T \in \mathbb{R}^T$ , $w^C \in \mathbb{R}^{d_\mathrm{model}}$ . This tensor is added to session inputs as $Q_k \leftarrow Q_k + BE_k$ for all $k$ .

Each session then undergoes multi-head self-attention. $Q_k$ is divided into $H$ heads of size $d_h = d_\mathrm{model}/H$ , and for head $h$ ,

$\begin{aligned} Q_{k,h}^Q &= Q_k W_h^Q, \ Q_{k,h}^K &= Q_k W_h^K, \ Q_{k,h}^V &= Q_k W_h^V, \ \mathrm{head}_{k,h} &= \mathrm{softmax}\left(\frac{Q_{k,h}^Q (Q_{k,h}^K)^\top}{\sqrt{d_h}}\right) Q_{k,h}^V. \end{aligned}$

The result across heads is concatenated, projected, passed through a feed-forward network (FFN) with residual connections and layer normalization, yielding $I_k^Q$ . Session-level interest embedding $I_k \in \mathbb{R}^{d_\mathrm{model}}$ is then obtained by average pooling over the session’s $T$ steps:

$I_k = \frac{1}{T}\sum_{t=1}^{T} I_{k,t}^Q.$

3. Inter-Session Modeling Using Bidirectional LSTM

Session interest embeddings $\{I_1, I_2, \dots, I_K\}$ serve as the input sequence to a standard bidirectional LSTM. For each timestep $t$ (corresponding to session $k=t$ ), cell computations proceed as follows:

$\begin{aligned} i_t &= \sigma(W_{xi}I_t + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i) \ f_t &= \sigma(W_{xf}I_t + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tanh(W_{xc}I_t + W_{hc}h_{t-1} + b_c) \ o_t &= \sigma(W_{xo}I_t + W_{ho}h_{t-1} + W_{co}c_t + b_o) \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$

Forward and backward passes generate $\overrightarrow{h_t}$ and $\overleftarrow{h_t}$ , concatenated as $H_t = [\overrightarrow{h_t}; \overleftarrow{h_t}] \in \mathbb{R}^{2d_\mathrm{model}}$ . This encodes both preceding and succeeding session interest dynamics for each session position, capturing the temporal evolution and context of user interests.

4. Session Relevance Weighting via Local Activation Unit

For prediction, session contributions are adaptively weighted via a local activation unit (an attention mechanism at the session level). Given the target item embedding $x^I \in \mathbb{R}^{d_\mathrm{model}}$ , attention scores over both raw session interests and context-enriched Bi-LSTM outputs are computed:

$a_k^I = \frac{\exp(I_k^\top W^I x^I)}{\sum_{j=1}^{K}\exp(I_j^\top W^I x^I)}, \quad U^I = \sum_{k=1}^{K}a_k^I I_k,$

$a_k^H = \frac{\exp(H_k^\top W^H x^I)}{\sum_{j=1}^{K}\exp(H_j^\top W^H x^I)}, \quad U^H = \sum_{k=1}^{K}a_k^H H_k,$

where $W^I$ and $W^H$ are learnable parameter matrices. $U^I$ and $U^H$ are final summary vectors aggregating session information, modulated by their relevance to the target item.

5. Prediction, Training, and Implementation Details

The final prediction vector is constructed by concatenating user profile embedding $X_{\text{user-profile}}^U$ , item profile embedding $X_{\text{item-profile}}^I$ , and the two session summary vectors:

$[ X_{\text{user-profile}}^U; X_{\text{item-profile}}^I; U^I; U^H ] \longrightarrow \text{MLP} \xrightarrow{\sigma} p(x),$

where the multilayer perceptron (MLP) outputs the click-through probability via a sigmoid activation. End-to-end training minimizes the binary cross-entropy loss:

$L = -\frac{1}{|\mathcal{D}|}\sum_{(x, y) \in \mathcal{D}} [ y \log p(x) + (1-y) \log(1-p(x)) ].$

Key implementation details include: embedding dimension $d_\mathrm{model}$ shared across all inputs; $T$ actions per session (typical $T=10$ ), $K$ most recent sessions ( $K=5$ ); single-layer Bi-LSTM per direction (hidden size $d_\mathrm{model}$ ); residual connections and layer normalization in attention/FFN blocks; and standard dropout in MLPs.

6. Applications and Empirical Performance

Evaluation was conducted on two datasets: the Alimama advertising dataset (26 million ad logs over 1 million users and 800,000 ads) and the Alibaba production recommender dataset (6 billion logs, 100 million users, 70 million items). Models were trained on seven days of data and evaluated on the eighth.

The Area Under the Receiver Operating Characteristic Curve (AUC) was employed as the performance metric. DSIN achieved AUC = 0.6375 on the advertising dataset, surpassing DIEN (0.6343), and AUC = 0.6515 on the recommender dataset. This establishes DSIN as outperforming prior state-of-the-art methods under these benchmarks (Feng et al., 2019).

7. Significance and Extensions

Session-based modeling via Bi-LSTM, as implemented by DSIN, unifies fine-grained session segmentation, bias-enhanced intra-session interest extraction, bidirectional inter-session interest modeling, and adaptive session relevance weighting. The empirical findings underscore the value of respecting session boundaries and tracking both homogeneous within-session behavior and heterogeneous cross-session dynamics for CTR prediction. A plausible implication is that such layered modeling architectures may generalize to other temporal user modeling tasks beyond CTR, provided similar behavioral session structures and interest dynamics are present (Feng et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Session Interest Network for Click-Through Rate Prediction (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Session-Based Predictions via Bidirectional LSTM.