Efficient user history modeling with amortized inference for deep learning recommendation models (2412.06924v1)

Published 9 Dec 2024 in cs.LG and cs.IR

Abstract: We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30\% compared to non-amortized inference.

Summary

The paper introduces early fusion techniques that integrate candidate items with user history to capture more relevant signals.
The study leverages amortized inference to reduce latency by about 30% in transformer-based recommendation models.
Experimental results on datasets like MovieLens and Netflix show similar prediction accuracy with significantly lower computational cost.

Efficient User History Modeling with Amortized Inference for Deep Learning Recommendation Models

This paper presents an investigation into the optimization of user history modeling for deep learning-based recommendation systems, with a focus on transformer encoders. The authors address the critical challenge of balancing improved recommendation quality with the increased computational overhead that often necessitates significant infrastructure investments. A proposed method leverages amortized history inference to mitigate these issues, particularly in Deep Learning Recommendation Models (DLRM).

Key Contributions

The research revisits early fusion techniques in user recommendation models, a process that integrates candidate items early in the sequence evaluation to extract more relevant historical signals. Two early fusion techniques are examined: appending the candidate item to the user's action history list and concatenating the candidate with each history step. Of particular interest is the adaptation of the M-FALCON algorithm for amortized history inference, which substantially reduces the inference latency associated with transformer-based models.

Methodology

Early Fusion Techniques: The paper provides a detailed comparison between appending and concatenating candidate items. Appending involves adding the candidate item at the end of the history list with cross-attention, while concatenating integrates the candidate item with each history element.
Optimization via Amortized Inference: By employing amortized inference, the authors propose that the computational load can be significantly alleviated. Instead of processing each history-action pair separately, their approach processes a batch of candidates concurrently, saving computational resources and reducing latency.

Experimental Results

The experimental setup utilizes public datasets such as MovieLens 20M, Amazon Books, Goodbooks, and Netflix, alongside LinkedIn's internal Feed and Ads datasets. Results indicate that both methods (append and concat) achieve similar engagement predictions but with distinctly different attention activation patterns. The use of cross-attention in the appending method demonstrates comparable accuracy while allowing for the efficiency benefits of amortized inference.

The experiments show that appending the candidate to the history sequence, using cross-attention, achieves similar performance levels as concatenating, with noticeable latency reductions in real-world applications. Amortized inference reduces latency by approximately 30%, illustrating its effectiveness in production.

Implications and Future Directions

The paper has clear implications for the design and deployment of recommendation systems that rely on user history modeling. By reducing latency and computational load, this approach supports more efficient large-scale systems without compromising on prediction quality. The attention visualization results suggest different learned pattern structures, which could be explored further to maximize information propagation across sequence steps.

This work opens avenues for future exploration in combining amortized inference with other architecture optimizations to further enhance system performance. Future studies might investigate the integration of different neural architectures, including multi-query attention, to expand the efficiency gains seen here.

Conclusion

Overall, this paper contributes to the conversation around efficient implementation and deployment of recommendation systems. By demonstrating the value of amortized inference, the authors provide a pathway to potentially more scalable and less resource-intensive recommendation engines. This contribution is vital for systems operating at web-scale, where latency impacts user experience and engagement metrics critically.