Effectiveness of Combining LVLM Features with Item IDs

Determine whether combining video representations derived from frozen Large Video Language Models with explicit item identifier embeddings yields improved performance in micro-video sequential recommendation.

Background

The paper investigates how to integrate frozen Large Video LLMs (LVLMs) into micro-video recommender systems, focusing on feature extraction and integration with traditional ID embeddings. Prior works often replace item IDs with content-only representations generated by LVLMs, implicitly assuming IDs are unnecessary.

The authors highlight uncertainty regarding whether fusing LVLM-derived video features with item IDs leads to better recommendations, motivating a systematic empirical study of replacement versus fusion strategies. This question sits at the core of balancing collaborative filtering signals (from IDs) and rich multimodal semantics (from LVLMs).

References

Second, it is still unclear whether combining LVLM-derived video representations with explicit item identifiers yields better recommendations.

Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion (2512.21863 - Sun et al., 26 Dec 2025) in Section 1 (Introduction)