Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation (2509.11524v1)

Published 15 Sep 2025 in cs.IR

Abstract: Fine-tuning LLMs for recommendation in a generative manner has delivered promising results, but encounters significant inference overhead due to autoregressive decoding in the language space. This work explores bypassing language-space decoding by directly matching candidate items with the LLM's internal thought representations in the latent space, eliminating the time-consuming autoregressive process to reduce computational costs. Towards this, we introduce Light Latent-space Decoding (L2D), an effective and efficient latent-space decoding method. L2D represents user-preferred items by using the hidden states of test sequences reflecting the LLM's internal thought, and obtains candidate item representations from the hidden states of training sequences labeled with the corresponding candidate items. It then matches the two types of representations to decode items, achieving latent-space decoding. In this way, it enables efficient decoding without altering the LLM's generative tuning paradigm, thereby preserving performance. Extensive empirical results demonstrate that L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.

Summary

  • The paper introduces L2D, a latent-space decoding framework that reduces inference latency by over 10x while maintaining recommendation accuracy.
  • It outlines a memory construction method and two aggregation strategies—global and local—to create candidate item representations from LLM hidden states.
  • Empirical results validate that L2D outperforms traditional baselines and scales effectively with larger LLMs, enabling plug-and-play recommendation.

Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation

Introduction and Motivation

The paper addresses a critical bottleneck in LLM-based generative recommendation: the substantial inference overhead incurred by autoregressive language-space decoding. While fine-tuning LLMs to generate item recommendations in natural language yields strong performance, the token-by-token generation process is computationally expensive and scales linearly with the recommendation list size. Existing acceleration techniques, such as grounding and speculative decoding, either degrade performance or remain confined to the language space. The central question is whether it is possible to bypass language-space decoding while preserving the generative training paradigm and leveraging the pretrained knowledge of LLMs.

Light Latent-space Decoding (L2D): Framework and Implementation

The proposed Light Latent-space Decoding (L2D) framework enables efficient item decoding by directly matching candidate items with the LLM's internal "thought" representations in the latent space, eliminating the need for autoregressive generation. The approach consists of three main steps:

  1. Memory Construction: After generative fine-tuning, the hidden states from the final LLM layer for each training sample (user history prompt) are paired with their ground-truth items and stored in a memory module. This step is fully pre-computable and does not affect inference latency.
  2. Candidate Item Representation Generation: For each candidate item, its latent representation is constructed by aggregating the hidden states associated with it in the memory. Two aggregation strategies are proposed:
    • Global Aggregation: Averages all hidden states for an item, yielding a comprehensive representation.
    • Local Aggregation: For a given test sample, selects the top-MM most similar hidden states (by L2 distance) from the memory and averages those associated with the candidate item, producing a test-specific representation.
  3. Item Decoding: At inference, the hidden state of the test sample is compared (via L2 distance) to the candidate item representations. The top-KK items with the highest similarity scores are recommended. Figure 1

    Figure 1: Latent-space decoding bypasses slow language-space decoding by matching candidate items with LLM internal 'thought' items in the latent space, preserving generative tuning and enabling efficient decoding.

    Figure 2

    Figure 2: L2D framework overview: memory construction, candidate item representation via global/local aggregation, and item decoding by similarity matching.

Implementation Details

  • Hidden State Extraction: For each prompt, extract the last hidden state from the final LLM layer.
  • Memory Storage: Store (hj,vj)(h_j, v_j) pairs for all training samples, where hjh_j is the hidden state and vjv_j is the ground-truth item.
  • Aggregation:
    • Global: For item vv, hˉv=1∣M(v)∣∑hj∈M(v)hj\bar{h}_v = \frac{1}{|\mathcal{M}(v)|} \sum_{h_j \in \mathcal{M}(v)} h_j.
    • Local: For test sample tt, select top-MM most similar hjh_j to hth_t; for item vv, hˉvt=1∣Mt(v)∣∑hj∈Mt(v)hj\bar{h}_v^t = \frac{1}{|\mathcal{M}_t(v)|} \sum_{h_j \in \mathcal{M}_t(v)} h_j.
  • Decoding: For test hidden state hth_t and candidate item representation hvh_v, compute S(ht,hv)=1∥ht−hv∥2S(h_t, h_v) = \frac{1}{\|h_t - h_v\|_2}; recommend top-KK items.

This approach is model-agnostic and does not require retraining or modification of the LLM architecture, making it plug-and-play for any generatively fine-tuned LLM recommender.

Empirical Results and Performance Analysis

Extensive experiments on Amazon CDs and Games datasets demonstrate that L2D achieves over 10x reduction in inference latency compared to language-space decoding, while maintaining or improving recommendation accuracy. The method outperforms both traditional baselines (SASRec, GRU4Rec) and LLM-based methods (AlphaRec, BIGRec, GPT4Rec, D3) across Recall@K and NDCG@K metrics. Figure 3

Figure 3

Figure 3: Recall@50 and inference overhead of LLM-based recommenders on two datasets; L2D achieves superior performance with minimal latency.

Aggregation Strategy Trade-offs

  • Global Aggregation is robust in sparse scenarios, where items have few associated hidden states.
  • Local Aggregation excels in dense scenarios, providing more personalized candidate item representations by focusing on test-specific aspects. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Performance of BIGRec, L2D-G, and L2D-L on sparse and dense scenarios; L2D-L is optimal in dense, L2D-G in sparse settings.

Hyperparameter Sensitivity

The local aggregation hyperparameter MM (number of top similar hidden states) controls the trade-off between personalization and robustness. Increasing MM improves performance up to a point, after which it converges to global aggregation. Figure 5

Figure 5

Figure 5: Impact of MM on Recall for L2D-L; performance saturates as MM increases, converging to L2D-G.

Figure 6

Figure 6

Figure 6: Impact of MM on NDCG for L2D-L; similar saturation behavior as Recall.

Comparison with ID-based Classifiers

Directly training a classifier head to predict item IDs from hidden states is less effective, especially for sparse items, and requires additional training. L2D, by leveraging generative training and latent matching, achieves higher accuracy and better generalization to sparse scenarios.

Space and Scalability Considerations

  • Memory Overhead: Storing hidden states for 10910^9 samples (1024-dim, float16) requires ~2TB, feasible for large-scale deployments.
  • Reservoir Sampling: Retaining only 30% of training samples still yields superior performance to baselines, enabling further space optimization.
  • Model Generalizability: L2D scales effectively to larger LLMs (e.g., Llama3.1-8B), with consistent performance gains.

Practical and Theoretical Implications

L2D fundamentally shifts the decoding paradigm for LLM-based recommendation from language-space to latent-space, enabling efficient, scalable, and high-performance inference. The approach is compatible with any generatively fine-tuned LLM and does not require architectural changes or retraining, facilitating rapid deployment in real-world systems. The latent matching mechanism preserves the benefits of generative training, including rich user understanding and preference mining, while eliminating the computational bottleneck of autoregressive decoding.

Theoretically, L2D demonstrates that the internal "thought" representations of LLMs are sufficiently expressive for recommendation tasks, and that direct latent-space matching can fully leverage pretrained knowledge. This opens avenues for further research into latent-space reasoning, memory-efficient representation learning, and plug-and-play personalization modules for LLMs.

Limitations and Future Directions

  • Memory Construction Overhead: Pre-computation of hidden states incurs time cost; more efficient memory construction methods are needed.
  • Cold-start Items: L2D cannot recommend items with zero interaction history; future work may explore interpolation or auxiliary models.
  • Memory Updating: Dynamic updating of memory with new user interactions is an open problem.
  • Integration with Preference Alignment: Combining L2D with causality-aware or difference-aware personalization approaches may further enhance performance.

Conclusion

The L2D framework provides an efficient and effective solution for LLM-based recommendation by decoding in the latent space, bypassing the limitations of autoregressive language-space generation. Empirical results validate its superiority in both accuracy and inference efficiency. The approach is theoretically sound, practically scalable, and extensible to future developments in LLM personalization and recommendation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube