Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 94 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 209 tok/s Pro

GPT OSS 120B 470 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding (2408.05676v2)

Published 11 Aug 2024 in cs.IR

Abstract: The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for LLM-generated text. Building on these insights, we introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER), which features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67\% of computational resources during the online A/B test on a large-scale advertising scenario with lossless downstream recommendation performance. Our code is available at https://github.com/YunjiaXi/LASER

Citations (2)

View on Semantic Scholar