Automatic Creative Selection with Cross-Modal Matching (2405.00029v1)

Published 28 Feb 2024 in cs.CV and cs.IR

Abstract: Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline using a Transformer model for search terms, and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs. Our approach achieves 0.96 AUC score for advertiser associated ground truth, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%. For human labeled ground truth, our approach achieves 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%.

Summary

The paper introduces a fine-tuned cross-modal BERT framework that significantly improves matching between app images and search phrases.
It leverages a mid-fusion LXMERT encoder to process text and images separately, achieving superior AUC scores of 0.96 and 0.95.
The approach enhances app discoverability by automating creative image selection, offering practical benefits for developers and advertisers.

Automatic Creative Selection with Cross-Modal Matching: A Summary

The paper "Automatic Creative Selection with Cross-Modal Matching" addresses a pertinent task in the domain of application development and promotion, specifically the process of matching application images with relevant search phrases. This problem is tackled using a novel approach that fine-tunes a pre-trained cross-modal model. By focusing on the integration of search-text and application-image data, the work offers a significant advancement in the automatic selection of creative images for search optimization.

Key Contributions and Methodology

The authors introduce a methodology that leverages a cross-modal BERT framework, customized through fine-tuning on a proprietary dataset containing labeled pairs of search phrases and ad images. The challenge addressed is the inadequacy of traditional state-of-the-art models, which perform well on image captioning but are less effective when applied to the domain of search terms, characterized by their brevity and lack of explicit descriptions.

This approach utilizes a sequential model built upon the LXMERT encoder, incorporating various deep learning components such as a linear layer, GELU activation, layer normalization, and a sigmoid function to output binary classification predictions of relevance between image and search phrase pairs. The embedding process involves the use of a WordPiece tokenizer for text and Faster R-CNN for image object detection, followed by a mid-fusion technique for cross-modal interaction.

Results and Evaluation

The performance of the proposed approach is evaluated against four baseline models: Zero-shot CLIP, Fine-tuned CLIP, XLM-R + ResNet, and XLM-R + CLIP image representation. As detailed in the results, the proposed method achieves superior performance with AUC scores of 0.96 and 0.95 on datasets evaluated from the perspectives of application developers and professional annotators respectively, significantly outperforming current methods by 8%-17%.

Notably, the improvement is attributed to the model's mid-fusion mechanism, whereby independent transformers process textual and visual data separately before applying a cross-modal encoder. This contrasts with baseline methods which use early-fusion techniques, demonstrating that mid-fusion approach better captures the complexities of cross-modal relationships in this context.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, it enhances the self-serve capabilities for application developers by recommending optimal images that align with search phrases, thereby potentially improving app discoverability and user engagement. Theoretically, it underscores the importance of advanced cross-modal interaction strategies in model architectures for complex multi-domain tasks.

Future research directions may explore the extension of this cross-modal approach to other domains where similar image-text matching issues arise. Additionally, further work could investigate improvements in model interpretability, allowing developers to understand the rationale behind specific image recommendations. The integration of user feedback and behavioral data could also refine the precision of these models for better application in dynamic environments. Overall, the paper contributes to a more nuanced understanding of cross-modal matching techniques and sets a foundation for future advancements in automated content selection.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1785910763225252227

https://twitter.com/gastronomy/status/1785882892859527316

https://twitter.com/CSVisionPapers/status/1785904932572348879

YouTube

Show All Videos