Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-end training of Multimodal Model and ranking Model (2404.06078v1)

Published 9 Apr 2024 in cs.IR

Abstract: Traditional recommender systems heavily rely on ID features, which often encounter challenges related to cold-start and generalization. Modeling pre-extracted content features can mitigate these issues, but is still a suboptimal solution due to the discrepancies between training tasks and model parameters. End-to-end training presents a promising solution for these problems, yet most of the existing works mainly focus on retrieval models, leaving the multimodal techniques under-utilized. In this paper, we propose an industrial multimodal recommendation framework named EM3: End-to-end training of Multimodal Model and ranking Model, which sufficiently utilizes multimodal information and allows personalized ranking tasks to directly train the core modules in the multimodal model to obtain more task-oriented content features, without overburdening resource consumption. First, we propose Fusion-Q-Former, which consists of transformers and a set of trainable queries, to fuse different modalities and generate fixed-length and robust multimodal embeddings. Second, in our sequential modeling for user content interest, we utilize Low-Rank Adaptation technique to alleviate the conflict between huge resource consumption and long sequence length. Third, we propose a novel Content-ID-Contrastive learning task to complement the advantages of content and ID by aligning them with each other, obtaining more task-oriented content embeddings and more generalized ID embeddings. In experiments, we implement EM3 on different ranking models in two scenario, achieving significant improvements in both offline evaluation and online A/B test, verifying the generalizability of our method. Ablation studies and visualization are also performed. Furthermore, we also conduct experiments on two public datasets to show that our proposed method outperforms the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  2. Efficient Optimal Selection for Composited Advertising Creatives with Tree Structure. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3967–3975.
  3. Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia. 811–820.
  4. Hybrid CNN Based Attention with Category Prior for User Image Behavior Modeling. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2336–2340.
  5. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
  6. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084 (2022).
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  9. End-to-end image-based fashion recommendation. In Workshop on Recommender Systems in Fashion and Retail. Springer, 109–119.
  10. Image matters: Visually modeling user behaviors using advanced model server. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2087–2095.
  11. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
  12. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  13. Sherlock: sparse hierarchical embeddings for visually-aware one-class collaborative filtering. arXiv preprint arXiv:1604.05813 (2016).
  14. Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
  15. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  17. A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1–41.
  18. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM conference on recommender systems. 43–50.
  19. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.
  20. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
  21. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
  24. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694–9705.
  25. Category-Specific CNN for Visual-aware CTR Prediction at JD. com. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2686–2696.
  26. Megcf: Multimodal entity graph collaborative filtering for personalized recommendation. ACM Transactions on Information Systems 41, 2 (2023), 1–27.
  27. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
  29. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
  30. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1222–1230.
  31. Image feature learning for cold start problem in display advertising. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  33. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  35. Improving language understanding by generative pre-training. (2018).
  36. Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
  37. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
  38. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285–295.
  39. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  40. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  41. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
  42. Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia (2022).
  43. Deepstyle: Multimodal search engine for fashion and interior design. IEEE Access 7 (2019), 84613–84628.
  44. Attention is all you need. Advances in neural information processing systems 30 (2017).
  45. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision. 568–578.
  46. Training large-scale news recommenders with pretrained language models in the loop. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4215–4225.
  47. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804 (2021).
  48. GRAM: Fast Fine-tuning of Pre-trained Language Models for Content-based Collaborative Filtering. arXiv preprint arXiv:2204.04179 (2022).
  49. Where to go next for recommender systems? id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
  50. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 3872–3880.
  51. Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning. arXiv preprint arXiv:2303.11879 (2023).
  52. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR) 52, 1 (2019), 1–38.
  53. Language-Enhanced Session-Based Recommendation with Decoupled Contrastive Learning. arXiv preprint arXiv:2307.10650 (2023).
  54. Embedding in Recommender Systems: A Survey. arXiv preprint arXiv:2310.18608 (2023).
  55. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
  56. A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions. arXiv preprint arXiv:2302.04473 (2023).
  57. Xin Zhou and Zhiqi Shen. 2023. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. 935–943.
  58. Bootstrap latent representations for multi-modal recommendation. In Proceedings of the ACM Web Conference 2023. 845–854.
Citations (1)

Summary

We haven't generated a summary for this paper yet.