Papers
Topics
Authors
Recent
2000 character limit reached

Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion (2511.02113v1)

Published 3 Nov 2025 in cs.IR

Abstract: Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.