ENCODE: Efficient Two-Stage CTR Modeling

Updated 20 August 2025

The paper introduces ENCODE, a two-stage framework that leverages full-sequence clustering and metric learning to extract relevant user interests for accurate CTR prediction.
It combines an offline extraction stage with metric learning-based dimensionality reduction and KMeans clustering to form multi-interest representations from long user behavior sequences.
The online inference stage uses a unified attention mechanism to rapidly generate target-aligned interest vectors, significantly reducing computational latency.

EfficieNt Clustering based twO-stage interest moDEling (ENCODE) is a two-stage framework designed to efficiently model long-term user behavior sequences for click-through rate (CTR) prediction. ENCODE addresses the primary challenges in long-term sequence modeling: maximizing the utilization of the entire user history (R1) and extracting interests highly relevant to the current target item (R2). The pipeline is composed of an offline extraction stage, which discovers multi-faceted user interests via clustering and metric learning-based dimensionality reduction, followed by an online inference stage that rapidly computes target-aligned interest representations using a unified attention-based relevance metric throughout the process.

1. Theoretical Foundations and Motivation

ENCODE is motivated by limitations in prior models that either sample only part of the sequence (resulting in information loss) or employ target-attention over the full sequence (yielding high accuracy but prohibitive inference cost for online serving). Existing retrieval-based methods often break alignment between the extraction of interests and the subsequent relevance calculation with target items, negatively affecting predictive performance (Zhou et al., 19 Aug 2025).

ENCODE’s two guiding requirements are:

R1: Leverage the entire behavior sequence so that no information is discarded.
R2: Ensure high relevance between the extracted interests and the current target item by using a consistent relevance metric across both stages.

The method is designed to break the trade-off between full-information modeling and online serving efficiency that constrains previous systems.

2. Offline Extraction: Metric Learning and Clustering

In the offline phase, ENCODE operates on a user behavior sequence of substantial length (often hundreds to thousands of events). User interests are operationalized as sub-interests, assumed to reside in clusters of similar behaviors. The full sequence is encoded with embeddings $s_1, s_2, ..., s_L \in \mathbb{R}^d$ .

Metric Learning-Based Dimensionality Reduction

Given the high-dimensional nature of typical behavior embeddings, ENCODE reduces clustering overhead by learning a projection:

$h = W_h^T e$

with $W_h \in \mathbb{R}^{d \times m}$ sampled from $\mathcal{N}(0,1)$ , and $m \ll d$ [Equation (A)]. A metric learning approach optimizes $W_h$ to preserve the relative pairwise distances between the projected embeddings, maintaining the semantic structure required for clustering.

For each behavior, positive/negative samples are selected by the distance relationships in the original space, and the following triplet loss is minimized:

$L_{aux} = \sum_{i} \max(0, dis(h_i, h_p) - dis(h_i, h_n) + \alpha_i)$

with a dynamic margin

$\alpha_i = dis(s_i, s_n) - dis(s_i, s_p)$

where $dis(\cdot,\cdot)$ is a cosine or euclidean distance; $h_p, h_n$ are positive/negative projected samples [Equations (B), (C)].

Clustering for Multi-Interest Extraction

The reduced representations $H = \{ h_i \}$ are clustered using a standard algorithm such as KMeans to produce $K$ clusters. For cluster $i$ with center $c_i$ and member indices $\mathcal{C}_i$ , the cluster’s multi-interest representation $u_i$ is computed by a weighted aggregation:

$u_i = \sum_{\mathrm{idx} \in \mathcal{C}_i} \frac{\exp(\mathrm{sim}(h_\mathrm{idx}, c_i))}{\sum_{j \in \mathcal{C}_i} \exp(\mathrm{sim}(h_j, c_i))} \cdot s_\mathrm{idx}$

[Equation (D)]

The similarity function is:

$\mathrm{sim}(h_i, h_j) = \frac{1 - dis(h_i, h_j)}{\beta}$

[Equation (E)]

This preserves high-order information and ensures the interest representation is target-aware, rather than a simple cluster centroid.

3. Online Inference: Attention-Based Interest Matching

During online serving, the previously extracted multi-interest set $U = \{u_1, u_2, ..., u_K\}$ and a target item $x_t$ are available for rapid matching.

Target-Aware Attention Mechanism

ENCODE computes the final user interest vector $I$ as a weighted sum over the multi-interests, with weights derived from the same relevance metric used in the offline stage:

$I = \sum_i \frac{\exp(\mathrm{sim}(u_i, x_t))}{\sum_j \exp(\mathrm{sim}(u_j, x_t))} \cdot u_i$

[Equation (F)]

This architecture guarantees that interests discovered offline align with the relevance computation in real-time prediction, directly satisfying requirement R2.

4. Computational Complexity and Efficiency

The method is crafted for maximal scalability:

Offline stage:
- Dimensionality reduction: $O(L \cdot d \cdot m)$
- Clustering: $O(T \cdot K \cdot L \cdot m)$ ( $T$ = iterations)
- Interest extraction: $O(L \cdot m)$
Online stage:
- For $B$ candidate items, attention computation is $O(B K d)$

By reducing the long sequence into just $K$ attentively-aggregated interest vectors (typical $K \approx 30$ ), ENCODE achieves inference latency dramatically lower than full-sequence attention models, with computation now linear in $K$ not $L$ .

5. Empirical Performance and Comparative Analysis

ENCODE was benchmarked against state-of-the-art methods (including SIM (Chang et al., 2023), SDIM, and others) on both industrial-scale datasets (hundreds of millions of records) and large public datasets (Amazon Books; MovieLens 32M).

Metrics: CTR AUC, Group AUC (GAUC), online inference latency
Findings:
- ENCODE matches or slightly trails the “upper bound” established by models with full target-attention over all historical behaviors (e.g., DIN-L), but at a fraction of the computational cost.
- On all datasets tested, ENCODE outperforms retrieval/sampling methods in CTR AUC and GAUC, attributed to its full-sequence utilization and consistent relevance metric alignment.
- Inference latency is significantly reduced relative to full attention models.

This demonstrates that ENCODE breaks the prevailing performance-efficiency trade-off by attaining near-optimal CTR prediction with production-grade latency.

6. Significance and Relation to Prior Art

ENCODE distinguishes itself from prior retrieval-based multi-interest frameworks (e.g., SIM Hard, SIM Soft, SDIM (Chang et al., 2023)) by deploying a unified relevance metric at both extraction and matching stages, ensuring that the interests surfaced offline are directly compatible with target-specific attention during inference.

Other contemporary approaches such as TWIN (Chang et al., 2023) and RimiRec (Pei et al., 2 Feb 2024) share elements of multi-interest extraction or two-stage architectures, but ENCODE’s metric learning plus clustering strategy achieves both maximal information retention and computational tractability. Unlike holistic interest compression methods (e.g., CHIME (Bai et al., 9 Apr 2025)), ENCODE clusters and then attentively aggregates, maintaining interpretability and online efficiency.

7. Practical Considerations and Future Directions

ENCODE is designed for real-world recommender systems requiring both high-fidelity interest modeling and fast response times. The clustering process can adapt to behavioral growth and shifts by periodically retraining on fresh data. Metric learning-based reduction can be further extended with advanced contrastive or self-supervised objectives.

This suggests ENCODE may serve as a blueprint for future hybrid models combining offline multi-interest extraction via clustering and efficient online target-attention fusion, applicable to domains beyond CTR, such as search ranking, personalized advertising, or content recommendation at scale.