A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation (2402.13587v2)
Abstract: In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a LLM-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of LLMs to produce the description. During training, we keep the visual encoder and LLM frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of LLMs, facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various LLM scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing automatic generation of product descriptions in a wide range of applications. Code is at: https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Stick to the facts: Learning towards a fidelity-oriented e-commerce product description generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4959–4968.
- Towards knowledge-based personalized product description generation in e-commerce. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3040–3050.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Temporal knowledge question answering via abstract reasoning induction. arXiv preprint arXiv:2311.09149.
- Multi-granularity temporal question answering over knowledge graphs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11378–11392, Toronto, Canada. Association for Computational Linguistics.
- Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR.
- Exploiting image–text synergy for contextual image captioning. In Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 30–37.
- Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11162–11173.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
- Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459.
- Deepdepict: Enabling information rich, personalized product description generation with the deep multiple pointer generator network. ACM Trans. Knowl. Discov. Data, 15(5).
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Lcsts: A large scale chinese short text summarization dataset. arXiv preprint arXiv:1506.05865.
- Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239.
- Multi-modal summary generation using multi-objective optimization. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1745–1748.
- e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1244–1254.
- Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823.
- Aspect-aware multimodal summarization for chinese e-commerce products. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8188–8195.
- Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5655–5667.
- Topic adaptation and prototype encoding for few-shot visual storytelling. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4208–4216.
- A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020.
- A neural divide-and-conquer reasoning framework for image retrieval from linguistically complex text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16464–16476, Toronto, Canada. Association for Computational Linguistics.
- Towards vision enhancing llms: Empowering multimodal knowledge storage and sharing in llms. arXiv preprint arXiv:2311.15759.
- A multi-modal context reasoning approach for conditional inference on joint textual and visual clues. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10757–10770, Toronto, Canada. Association for Computational Linguistics.
- Knowledge graph contrastive learning based on relation-symmetrical structure. IEEE Transactions on Knowledge and Data Engineering, pages 1–12.
- Structure guided multi-modal pre-trained transformer for knowledge graph reasoning. arXiv preprint arXiv:2307.03591.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Hui Lin and Vincent Ng. 2019. Abstractive summarization: A survey of the state of the art. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):9815–9822.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- Generative imagination elevates machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5738–5748, Online. Association for Computational Linguistics.
- Imagination-augmented natural language understanding. NACCL.
- Descriptions from the customers: Comparative analysis of review-based product description generation methods. 20(4).
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Hindi visual genome: A dataset for multi-modal english to hindi machine translation. Computación y Sistemas, 23(4):1499–1505.
- ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1173–1186, Online. Association for Computational Linguistics.
- Xiangyu Peng and Michael Sollami. 2022. Xfboost: Improving text generation with controllable decoders. arXiv preprint arXiv:2202.08124.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Visually grounded neural syntax acquisition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1842–1861, Florence, Italy. Association for Computational Linguistics.
- Enhancing descriptive image captioning with natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 269–277.
- Dptdr: Deep prompt tuning for dense passage retrieval. COLING.
- Improving ocr-based image captioning by incorporating geometrical relationship. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1306–1315.
- An efficient memory-augmented transformer for knowledge-intensive nlp tasks. arXiv preprint arXiv:2210.16773.
- K-PLUG: Knowledge-injected pre-trained language model for natural language understanding and generation in E-commerce. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1–17, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335.
- Open domain dialogue generation with latent images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14239–14247.
- On the faithfulness for E-commerce product summarization. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5712–5717, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Probing product description generation via posterior distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14301–14309.
- Automatic generation of pattern-controlled product description in e-commerce. In The World Wide Web Conference, WWW ’19, page 2355–2365, New York, NY, USA. Association for Computing Machinery.
- Bertscore: Evaluating text generation with bert. International Conference on Learning Representations (ICLR).
- Mengzi: Towards lightweight yet ingenious pre-trained models for chinese. arXiv preprint arXiv:2110.06696.
- Msmo: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4154–4164.
- Multimodal summarization with guidance of multimodal reference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):9749–9756.
- Visualize before you write: Imagination-guided open-ended text generation. arXiv preprint arXiv:2210.03765.
- Yunxin Li (29 papers)
- Baotian Hu (67 papers)
- Wenhan Luo (88 papers)
- Lin Ma (206 papers)
- Yuxin Ding (9 papers)
- Min Zhang (630 papers)