Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Large Language Models for Multimodal Search (2404.15790v1)

Published 24 Apr 2024 in cs.CV

Abstract: Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating LLMs to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In International Joint Conference on Natural Language Processing, 2021.
  2. Layer normalization. arXiv:1607.06450, 2016.
  3. Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In CVPRW, New Orleans, LA, USA, 2022a.
  4. Effective conditioned and composed image retrieval combining CLIP-based features. In CVPR, 2022b.
  5. Zero-Shot Composed Image Retrieval with Textual Inversion. In ICCV, 2023.
  6. Metric Learning. Morgan & Claypool Publishers (USA), Synthesis Lectures on Artificial Intelligence and Machine Learning, pp 1-151, 2015.
  7. Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language. arXiv:2306.16410, 2023.
  8. Language Models are Few-Shot Learners. In NeurIPS, 2020.
  9. Harrison Chase. LangChain. https://github.com/langchain-ai/langchain, 2022.
  10. Image search with text feedback by visiolinguistic attention learning. In CVPR, 2020.
  11. Scaling instruction-finetuned language models. arXiv:2210.11416, 2022.
  12. Embedding Arithmetic of Multimodal Queries for Image Retrieval. In CVPRW, 2022.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  15. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  16. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. In CVPR, 2022.
  17. CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion. arXiv:2303.11916, 2023.
  18. Visual Programming: Compositional Visual Reasoning Without Training. In CVPR, 2023.
  19. Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
  20. Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training. In CVPR, 2023.
  21. LoRA: Low-Rank Adaptation of Large Language Models, 2021. arXiv:2106.09685 [cs].
  22. Multimodal residual learning for visual qa. In NeurIPS, 2016.
  23. Grounding Language Models to Images for Multimodal Inputs and Outputs. In ICML, 2023.
  24. Large Language Models are Zero-Shot Reasoners. In NeurIPS, 2022.
  25. Cosmo: Content-style modulation for image retrieval with text feedback. In CVPR, 2021.
  26. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS, 2020.
  27. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  29. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, 2023.
  30. Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021. arXiv:2101.00190 [cs].
  31. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS, 2022.
  32. Visual Instruction Tuning. arXiv:2304.08485, 2023.
  33. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  34. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
  35. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  36. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  38. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752, 2021.
  39. Fashion-gen: The generative fashion dataset and challenge. arXiv:1806.08317, 2018.
  40. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. In CVPR, 2023.
  41. Divide and conquer the embedding space for metric learning. In CVPR, 2019.
  42. A simple neural network module for relational reasoning. In NeurIPS, 2017.
  43. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  44. ViperGPT: Visual Inference via Python Execution for Reasoning. arXiv:2303.08128, 2023.
  45. Attention is all you need. In NeurIPS, 2017.
  46. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In CVPR, 2019.
  47. Cross-batch memory for embedding learning. In CVPR, 2020.
  48. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a.
  49. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022b.
  50. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989.
  51. Transformers: State-of-the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020.
  52. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv:2303.04671, 2023.
  53. Sampling matters in deep embedding learning. In ICCV, 2017.
  54. Automatic Chain of Thought Prompting in Large Language Models. In ICLR, 2023a.
  55. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923, 2023b.
  56. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv:2306.05685, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.