IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT (2404.02059v3)
Abstract: Multimodal foundation models are transformative in sequential recommender systems, leveraging powerful representation learning capabilities. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt foundation models for recommendation tasks, most research prioritizes parameter efficiency, often overlooking critical factors like GPU memory efficiency and training speed. Addressing this gap, our paper introduces IISAN (Intra- and Inter-modal Side Adapted Network for Multimodal Representation), a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intra- and inter-modal adaptation. IISAN matches the performance of full fine-tuning (FFT) and state-of-the-art PEFT. More importantly, it significantly reduces GPU memory usage - from 47GB to just 3GB for multimodal sequential recommendation tasks. Additionally, it accelerates training time per epoch from 443s to 22s compared to FFT. This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training. Furthermore, we propose a new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency) to alleviate the prevalent misconception that "parameter efficiency represents overall efficiency". TPME provides more comprehensive insights into practical efficiency comparisons between different methods. Besides, we give an accessible efficiency analysis of all PEFT and FFT approaches, which demonstrate the superiority of IISAN. We release our codes and other materials at https://github.com/GAIR-Lab/IISAN.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Tallrec: An effective and efficient tuning framework to align large language model with recommendation. arXiv preprint arXiv:2305.00447 (2023).
- Tom Brown et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems 33 (2020), 11285–11297.
- Strategies to achieve a carbon neutral society: a review. Environmental Chemistry Letters 20, 4 (2022), 2277–2310.
- Mobile edge cache strategy based on neural collaborative filtering. IEEE Access 8 (2020), 18475–18482.
- Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022).
- An Image Dataset for Benchmarking Recommender Systems with Raw Pixels. arXiv preprint arXiv:2309.06789 (2023).
- M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084 (2022).
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- End-to-End Image-Based Fashion Recommendation. arXiv preprint arXiv:2205.02923 (2022).
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019).
- Exploring adapter-based transfer learning for recommender systems: Empirical studies and practical insights. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 208–217.
- Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983 (2021).
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
- Structured multi-modal feature embedding and alignment for image-sentence retrieval. In Proceedings of the 29th ACM international conference on multimedia. 5185–5193.
- Colloquial image captioning. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 356–361.
- Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1022–1031.
- 3SHNet: Boosting image–sentence retrieval via visual semantic–spatial self-highlighting. Information Processing & Management 61, 4 (2024), 103716.
- VIP5: Towards Multimodal Foundation Models for Recommendation. arXiv preprint arXiv:2305.14302 (2023).
- The reversible residual network: Backpropagation without storing activations. Advances in neural information processing systems 30 (2017).
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
- Parameter-efficient Fine-tuning for Vision Transformers. https://doi.org/10.48550/ARXIV.2203.16329
- Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
- Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
- Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv preprint arXiv:2210.12316 (2022).
- Towards Universal Sequence Representation Learning for Recommender Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision. arXiv preprint arXiv:2401.07257 (2024).
- Shashank Mohan Jain. 2022. Hugging face. In Introduction to transformers for NLP: With the hugging face library and models to solve problems. Springer, 51–67.
- Online distillation-enhanced multi-modal transformer for sequential recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. 955–965.
- Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
- Caching strategies to improve disk system performance. Computer 27, 3 (1994), 38–46.
- Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems 34 (2021), 1022–1035.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Walid Krichene and Steffen Rendle. 2020. On Sampled Metrics for Item Recommendation. In KDD.
- Exploring strategies for training deep neural networks. Journal of machine learning research 10, 1 (2009).
- Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
- An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders. arXiv preprint arXiv:2403.17372 (2024).
- Multi-Modality is All You Need for Transferable Recommender Systems. arXiv preprint arXiv:2312.09602 (2023).
- Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309.
- Once: Boosting content-based recommendation with both open-and closed-source large language models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 452–461.
- ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation. arXiv preprint arXiv:2311.05956 (2023).
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
- Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In Proceedings of the 28th ACM International Conference on Multimedia. 3460–3468.
- ID-Agnostic User Behavior Pre-training for Sequential Recommendation. In China Conference on Information Retrieval. Springer, 16–27.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
- A Content-Driven Micro-Video Recommendation Dataset at Scale. arXiv preprint arXiv:2309.15379 (2023).
- Towards Efficient and Effective Adaptation of Large Language Models for Sequential Recommendation. arXiv preprint arXiv:2310.01612 (2023).
- Adapterhub: A framework for adapting transformers. arXiv preprint arXiv:2007.07779 (2020).
- Thoroughly Modeling Multi-domain Pre-trained Recommendation as Language. arXiv preprint arXiv:2310.13540 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Pooneh Safayenikoo and Ismail Akturk. 2021. Weight update skipping: Reducing training time for artificial neural networks. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 11, 4 (2021), 563–574.
- Ambuj D Sagar and Adil Najam. 1998. The human development index: a critical review. Ecological economics 25, 3 (1998), 249–264.
- Steven S Skiena. 1998. The algorithm design manual. Vol. 2. Springer.
- BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
- Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management. 1405–1414.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems 35 (2022), 12991–13005.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback. arXiv preprint arXiv:2206.06190 (2022).
- MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. 6548–6557.
- K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808 (2020).
- Multi-Modal Self-Supervised Learning for Recommendation. In Proceedings of the ACM Web Conference 2023. 790–800.
- MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.
- Carbon-neutral pathways for the United States. AGU advances 2, 1 (2021), e2020AV000284.
- Empowering news recommendation with pre-trained language models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1652–1656.
- Mm-rec: multimodal news recommendation. arXiv preprint arXiv:2104.07407 (2021).
- Learning to transfer graph embeddings for inductive graph based recommendation. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1211–1220.
- SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2945–2954.
- Tackling Vision Language Tasks Through Learning Inner Monologues. arXiv preprint arXiv:2308.09970 (2023).
- Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems. 269–277.
- 1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20116–20126.
- Where to go next for recommender systems? id-vs. modality-based recommender models revisited. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021).
- NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation. arXiv preprint arXiv:2309.07705 (2023).
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021).
- Trar: Routing the attention spans in transformer for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. 2074–2084.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.