Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations (2007.07203v2)

Published 12 Jul 2020 in cs.IR, cs.LG, and stat.ML

Abstract: One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model, and then use some approximate nearest neighbor (ANN) search algorithm to find top candidates. In this paper, we present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data (e.g. clicks) without resorting to the Euclidean space assumption in ANN algorithms. DR's structure encodes all candidate items into a discrete latent space. Those latent codes for the candidates are model parameters and learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the structure is performed to retrieve the top candidates for reranking. Empirically, we first demonstrate that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline on two public datasets. Moreover, we show that, in a live production recommendation system, a deployed DR approach significantly outperforms a well-tuned ANN baseline in terms of engagement metrics. To the best of our knowledge, DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a retrievable structure (DR) that learns a discrete latent space for efficient candidate retrieval in recommendation systems.
It employs a multi-layer MLP with beam search over a K-ary tree structure to learn user-item interactions without relying on conventional ANN assumptions.
Experimental results on MovieLens-20M and Amazon books show DR’s competitive performance and significant improvements in user engagement metrics.

The paper introduces Deep Retrieval (DR), a novel method designed for large-scale recommendation systems, addressing the challenge of efficiently retrieving relevant candidates in sub-linear time. DR distinguishes itself from traditional approaches by directly learning a retrievable structure from user-item interaction data, such as clicks, without relying on the Euclidean space assumption inherent in Approximate Nearest Neighbor (ANN) algorithms.

DR encodes all candidate items into a discrete latent space, where the latent codes for these candidates are model parameters learned in conjunction with other neural network parameters. This learning process is geared towards maximizing a singular objective function. Once the model is trained, a beam search is executed over the learned structure to retrieve the top candidates for re-ranking.

Key components and design considerations of DR include:

A structure model consisting of $D$ layers, each with $K$ nodes. Each layer uses a multi-layer perceptron (MLP) and $K$ -class softmax to output a distribution over its $K$ nodes.
An item-to-path mapping $\pi: \mathcal{V} \to [K]^{D}$ , where $\mathcal{V}$ represents the set of all items. A path $c$ is defined as the forward index traverse over matrix columns. Each path has a length of $D$ , with index values ranging from $\{1, 2, \dots, K\}$ . Consequently, there are $K^D$ possible paths, each representing a cluster of items.

The model learns a probability distribution over the paths given user inputs, concurrently with a mapping from items to paths. During the serving phase, beam search is employed to identify the most probable paths and the items associated with them.

The probability of a path $c$ given a user $x$ , denoted as $p(c | x, \theta)$ , is constructed layer by layer:

The initial layer takes the user embedding ${\rm emb}(x)$ as input and outputs a probability $p(c_1 | x, \theta_1)$ over the $K$ nodes, based on parameters $\theta_1$ .
Subsequent layers concatenate the user embedding ${\rm emb}(x)$ with the embeddings of all preceding layers ${\rm emb}(c_{d-1})$ as input to an MLP, which outputs $p(c_d| x, c_1, \dots, c_{d-1}, \theta_d)$ over the $K$ nodes of layer $d$ , based on parameters $\theta_d$ .
The path probability is then calculated as the product of the probabilities of all layers' outputs:
- $p(c | x, \theta)$ is the probability of path $c$ given user $x$ .
- $D$ is the number of layers.
- $c_d$ is the node in layer $d$ along path $c$ .
- $x$ is the user.
- $\theta_d$ represents the parameters of layer $d$ .

To enhance the model's capacity to express multi-aspect information, DR allows each item $y_i$ to be assigned to $J$ different paths $\{c_{i, 1}, \dots, c_{i, J}\}$ . The multi-path structure objective is defined as: $\mathcal{Q}_{\rm str}(\theta, \pi) = \sum_{i=1}^N \log \left( \sum_{j=1}^J p(c_{i,j} = \pi_j(y_i)| x_i, \theta) \right)$ , where the probability of belonging to multiple paths is the summation of the probabilities of belonging to individual paths.

To prevent the model from collapsing into allocating all items into a single path, a penalized likelihood function is introduced: $\mathcal{Q}_{\rm pen}(\theta, \pi) = \mathcal{Q}_{\rm str}(\theta, \pi) - \alpha \cdot \sum_{c \in [K]^D} f(|c|)$ , where $\alpha$ is the penalty factor, $|c|$ denotes the number of items allocated in path $c$ , and $f$ is an increasing and convex function (e.g., $f(|c|) = |c|^4/4$ ).

In the inference stage, beam search is used to retrieve the most probable paths. The algorithm selects the top $B$ nodes from all successors of the selected nodes from the previous layer, returning the top $B$ paths in the final layer.

The model is trained using an Expectation-Maximization (EM) type algorithm. The EM algorithm involves iteratively optimizing the model parameters $\theta$ for a fixed mapping $\pi$ (E-step) and updating the mapping $\pi$ to maximize the objective function given the updated parameters (M-step).

To further improve performance, the DR model is jointly trained with a re-ranking model, specifically a softmax model with output size $V$ , where $V$ is the total number of items. The final objective function is a combination of the penalized likelihood function and the softmax objective: $\mathcal{Q} = \mathcal{Q}_{\rm pen} + \mathcal{Q}_{\rm softmax}$ .

The paper includes experiments conducted on two public datasets, MovieLens-20M and Amazon books, to evaluate the performance of DR. The results demonstrate that DR achieves performance comparable to brute-force retrieval methods while maintaining sub-linear computational complexity. Furthermore, DR was deployed in a live production recommendation system with hundreds of millions of users and items, where it significantly outperformed a well-tuned ANN baseline in terms of engagement metrics such as video finish rate (+3.0\%), app view time (+0.87%), and second-day retention (+0.036%).

PDF Markdown

Related Papers

Tweets

https://twitter.com/liamzebedee/status/1912091196836782290

YouTube

Show All Videos