Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express (2408.14698v2)

Published 26 Aug 2024 in cs.IR, cs.AI, cs.CL, and cs.CV

Abstract: As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.

Summary

The paper introduces a novel multi-modal search architecture that integrates sparse and dense embeddings to enhance search relevance for Adobe Express.
It employs iterative fine-tuning and AB testing with domain-specific CLIP models to reduce null rates and improve click-through rates.
The study leverages a multi-modal creative knowledge graph to align user intent with asset retrieval, providing scalable improvements in search performance.

Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express

The presented paper introduces a multi-modal search architecture designed specifically for Adobe Express, integrating contextual sparse and dense embeddings to enhance search relevance, particularly for multi-modal content such as templates. This work highlights the importance of effectively combining embeddings and multi-modal technologies to meet the growing demands of contemporary search systems, addressing the inadequacies of traditional text-based indexing methods.

Overview of the Multi-Modal Search Architecture

The traditional approach to search systems predominantly relies on textual and metadata annotations for indexed images. However, advancements in multi-modal embeddings, such as CLIP, enable direct search of image content using text and image embeddings. Despite this, embedding-based approaches encounter challenges in integrating contextual features, such as user locale and recency of content. The paper addresses these challenges by presenting a multi-modal search architecture optimized through iterative fine-tuning and AB tests.

The architecture leverages both dense and sparse embeddings alongside contextual features to improve the search experience. Key decisions included choosing appropriate embedding models, determining their roles in matching and ranking, and balancing between dense and sparse embeddings to optimize for relevance and latency. The experimental design particularly focused on reducing null rates and improving click-through rates (CTR).

Data and Models

Template Data

Adobe Express templates are rich, multi-modal documents that encompass images, text, and metadata. These templates are further enriched with additional inferred information, such as multi-modal embeddings, user intents, and image tags. Behavioral data such as impressions, clicks, edits, and exports are also utilized to refine search relevance.

Image-Text CLIP Embeddings

The paper employed a domain-specific variant of the CLIP model, trained on Adobe-licensed image-text data to meet specific requirements such as handling short queries and captions, supporting multiple languages, and maintaining high performance for high-quality visual data. A sparsification method was developed to use embeddings as keywords within the existing search infrastructure, allowing for rapid matching with minimal latency without sacrificing significant accuracy.

Multi-Modal Creative Knowledge Graph (MM-CKG)

The MM-CKG model maps content intent to discrete nodes, enhancing both recall and explainability. This model is particularly effective for identifying specific user intents, scenes, and objects. It employs supervised contrastive training to align asset intents with discrete labels in a comprehensive knowledge graph of over 100,000 nodes, further enabling accurate and meaningful asset retrieval.

Iterative Experiments

Several experiments were conducted to iteratively enhance the Adobe Express multi-modal search system:

Reranking with External Image-Text Model: Initial experiments employed an external CLIP model for reranking. This approach improved CTR and export rate but had limitations in handling diverse visual content and maintaining freshness.
Null and Low Recovery with Symbolic Multi-Modal Intents: By leveraging symbolic intents from the CKG, the system significantly reduced null and low recovery rates, leading to substantial improvements in CTR and user engagement for queries returning few results.
Ranking with Domain-specific Image-Text Model: Adobe-specific CLIP models were integrated to optimize for the platform’s unique content and multilingual requirements. This resulted in a stable performance transition with reduced null rates.
Recall with Sparse Image-Text Model: Incorporating sparse embeddings into the recall phase allowed for enhanced retrieval of relevant documents while maintaining low latency, drastically reducing null and low recovery rates.
Long Query Recall and Ranking with Multi-Modal Model: For long queries, the MM-CKG model was employed to align query and template intents, significantly boosting CTR and reducing null rates.

Conclusion

By methodically fine-tuning embeddings and integrating multi-modal technologies, the research demonstrates significant improvements in the relevance and efficiency of multi-modal search systems. The iterative approach not only reduced null and low result rates but also enhanced user engagement metrics such as CTR. This paper provides valuable insights for the development of scalable and robust multi-modal search systems, setting the stage for future advancements in embedding and search technologies. The practical implications of this work highlight the potential for improved search experiences across diverse multi-modal content domains.

Future Developments

Future research could focus on further refinements in embedding sparsification techniques to enhance accuracy without compromising latency. Additionally, exploring advanced re-ranking algorithms and deeper integration of user behavioral data could yield further improvements in search relevance and user satisfaction. The advancements in this paper lay a foundation for ongoing innovation in multi-modal search capabilities, potentially transforming the landscape of search technology.

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1828628639447163077