Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations (2007.07203v2)

Published 12 Jul 2020 in cs.IR, cs.LG, and stat.ML

Abstract: One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model, and then use some approximate nearest neighbor (ANN) search algorithm to find top candidates. In this paper, we present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data (e.g. clicks) without resorting to the Euclidean space assumption in ANN algorithms. DR's structure encodes all candidate items into a discrete latent space. Those latent codes for the candidates are model parameters and learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the structure is performed to retrieve the top candidates for reranking. Empirically, we first demonstrate that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline on two public datasets. Moreover, we show that, in a live production recommendation system, a deployed DR approach significantly outperforms a well-tuned ANN baseline in terms of engagement metrics. To the best of our knowledge, DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.

Citations (18)

Summary

  • The paper introduces a retrievable structure (DR) that learns a discrete latent space for efficient candidate retrieval in recommendation systems.
  • It employs a multi-layer MLP with beam search over a K-ary tree structure to learn user-item interactions without relying on conventional ANN assumptions.
  • Experimental results on MovieLens-20M and Amazon books show DR’s competitive performance and significant improvements in user engagement metrics.

The paper introduces Deep Retrieval (DR), a novel method designed for large-scale recommendation systems, addressing the challenge of efficiently retrieving relevant candidates in sub-linear time. DR distinguishes itself from traditional approaches by directly learning a retrievable structure from user-item interaction data, such as clicks, without relying on the Euclidean space assumption inherent in Approximate Nearest Neighbor (ANN) algorithms.

DR encodes all candidate items into a discrete latent space, where the latent codes for these candidates are model parameters learned in conjunction with other neural network parameters. This learning process is geared towards maximizing a singular objective function. Once the model is trained, a beam search is executed over the learned structure to retrieve the top candidates for re-ranking.

Key components and design considerations of DR include:

  • A structure model consisting of DD layers, each with KK nodes. Each layer uses a multi-layer perceptron (MLP) and KK-class softmax to output a distribution over its KK nodes.
  • An item-to-path mapping π:V[K]D\pi: \mathcal{V} \to [K]^{D}, where V\mathcal{V} represents the set of all items. A path cc is defined as the forward index traverse over matrix columns. Each path has a length of DD, with index values ranging from {1,2,,K}\{1, 2, \dots, K\}. Consequently, there are KDK^D possible paths, each representing a cluster of items.

The model learns a probability distribution over the paths given user inputs, concurrently with a mapping from items to paths. During the serving phase, beam search is employed to identify the most probable paths and the items associated with them.

The probability of a path cc given a user xx, denoted as p(cx,θ)p(c | x, \theta), is constructed layer by layer:

  • The initial layer takes the user embedding emb(x){\rm emb}(x) as input and outputs a probability p(c1x,θ1)p(c_1 | x, \theta_1) over the KK nodes, based on parameters θ1\theta_1.
  • Subsequent layers concatenate the user embedding emb(x){\rm emb}(x) with the embeddings of all preceding layers emb(cd1){\rm emb}(c_{d-1}) as input to an MLP, which outputs p(cdx,c1,,cd1,θd)p(c_d| x, c_1, \dots, c_{d-1}, \theta_d) over the KK nodes of layer dd, based on parameters θd\theta_d.
  • The path probability is then calculated as the product of the probabilities of all layers' outputs:
    • p(cx,θ)p(c | x, \theta) is the probability of path cc given user xx.
    • DD is the number of layers.
    • cdc_d is the node in layer dd along path cc.
    • xx is the user.
    • θd\theta_d represents the parameters of layer dd.

To enhance the model's capacity to express multi-aspect information, DR allows each item yiy_i to be assigned to JJ different paths {ci,1,,ci,J}\{c_{i, 1}, \dots, c_{i, J}\}. The multi-path structure objective is defined as: Qstr(θ,π)=i=1Nlog(j=1Jp(ci,j=πj(yi)xi,θ))\mathcal{Q}_{\rm str}(\theta, \pi) = \sum_{i=1}^N \log \left( \sum_{j=1}^J p(c_{i,j} = \pi_j(y_i)| x_i, \theta) \right), where the probability of belonging to multiple paths is the summation of the probabilities of belonging to individual paths.

To prevent the model from collapsing into allocating all items into a single path, a penalized likelihood function is introduced: Qpen(θ,π)=Qstr(θ,π)αc[K]Df(c)\mathcal{Q}_{\rm pen}(\theta, \pi) = \mathcal{Q}_{\rm str}(\theta, \pi) - \alpha \cdot \sum_{c \in [K]^D} f(|c|), where α\alpha is the penalty factor, c|c| denotes the number of items allocated in path cc, and ff is an increasing and convex function (e.g., f(c)=c4/4f(|c|) = |c|^4/4).

In the inference stage, beam search is used to retrieve the most probable paths. The algorithm selects the top BB nodes from all successors of the selected nodes from the previous layer, returning the top BB paths in the final layer.

The model is trained using an Expectation-Maximization (EM) type algorithm. The EM algorithm involves iteratively optimizing the model parameters θ\theta for a fixed mapping π\pi (E-step) and updating the mapping π\pi to maximize the objective function given the updated parameters (M-step).

To further improve performance, the DR model is jointly trained with a re-ranking model, specifically a softmax model with output size VV, where VV is the total number of items. The final objective function is a combination of the penalized likelihood function and the softmax objective: Q=Qpen+Qsoftmax\mathcal{Q} = \mathcal{Q}_{\rm pen} + \mathcal{Q}_{\rm softmax}.

The paper includes experiments conducted on two public datasets, MovieLens-20M and Amazon books, to evaluate the performance of DR. The results demonstrate that DR achieves performance comparable to brute-force retrieval methods while maintaining sub-linear computational complexity. Furthermore, DR was deployed in a live production recommendation system with hundreds of millions of users and items, where it significantly outperformed a well-tuned ANN baseline in terms of engagement metrics such as video finish rate (+3.0\%), app view time (+0.87%), and second-day retention (+0.036%).

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com