Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning (2402.03607v2)
Abstract: The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual LLMs (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.
- Crossmodal semantic congruence interacts with object contextual consistency in complex visual scenes to enhance short-term memory performance. Brain Sciences, 11(9): 1206.
- ERNIE-NLI: Analyzing the Impact of Domain-Specific External Knowledge on Enhanced Representations for NLI. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 58–69. Online: Association for Computational Linguistics.
- On the congruence of modularity and code coupling. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 354–364.
- Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26.
- The Effects of Multimodal Elements on Success in Kickstarter Crowdfunding Campaigns. Journal of Business and Technical Communication, 37(1): 1–27.
- Understanding and classifying image tweets. In Proceedings of the 21st ACM international conference on Multimedia, 781–784.
- Velda: Relating an image tweet’s text and images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
- Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
- Success Prediction on Crowdfunding with Multimodal Deep Learning. In IJCAI, 2158–2164.
- HOLM: Hallucinating Objects with Language Models for Referring Expression Recognition in Partially-Observed Scenes. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5440–5453.
- ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Frontiers in Public Health, 11: 1166120.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Google, KT, Language, AI: BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 4171–4186.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
- Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension. Language, Cognition and Neuroscience, 32(9): 1192–1203.
- ConceptBert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, 489–498. Stroudsburg, PA, USA: Association for Computational Linguistics.
- Hasan, A. 2023. Why you are (probably) anthropomorphizing AI (Short Version).
- Deep residual learning for image recognition. arXiv 2015. arXiv preprint arXiv:1512.03385, 14.
- Estimating the information gap between textual and visual representations. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, 14–22.
- Image specificity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2727–2736.
- Leveraging concept-enhanced pre-training model and masked-entity language model for named entity disambiguation. IEEE Access, 8: 100469–100484.
- Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950.
- Supervised Multimodal Bitransformers for Classifying Images and Text.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
- Integrating text and image: Determining multimodal document intent in instagram posts. arXiv preprint arXiv:1904.09073.
- Predictive analysis on Twitter: Techniques and applications. Emerging research challenges and opportunities in computational social network analysis and mining, 67–104.
- Knowledge infused learning (k-il): Towards deep incorporation of knowledge in deep learning. Proceedings of the AAAI 2020 Spring Symposium on Combining Machine Learning and Knowledge Engineering in Practice (AAAI-MAKE.
- Semantic congruence is a critical factor in multisensory behavioral performance. Experimental brain research, 158: 405–414.
- Evaluating the Effect of Semantic Congruency and Valence on Multisensory Integration. Multisensory Research, 35(4): 309–334.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 12888–12900. PMLR.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 12888–12900. PMLR.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Barack’s Wife Hillary: Using Knowledge-Graphs for Fact-Aware Language Modeling.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs.
- Semantic distance norms computed from an electronic dictionary (WordNet). Behavior Research Methods, Instruments, & Computers, 36: 421–431.
- Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 92: 57–78.
- Manning, A. D. 1998. Understanding comics: The invisible art.
- Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14111–14121.
- A taxonomy of relationships between images and text. Journal of documentation, 59(6): 647–672.
- A system for image–text relations in new (and old) media. Visual communication, 4(3): 337–371.
- On Faithfulness and Factuality in Abstractive Summarization.
- The radicalization risks of GPT-3 and advanced neural language models. arXiv preprint arXiv:2009.06807.
- Do We Still Need Human Assessors? Prompt-Based GPT-3 User Simulation in Conversational AI. In Proceedings of the 4th Conference on Conversational User Interfaces, number Article 8 in CUI ’22, 1–6. New York, NY, USA: Association for Computing Machinery.
- The language that gets people to give: Phrases that predict success on kickstarter. Proceedings of the 17th ACM conference on.
- A Hybrid Siamese Neural Network for Natural Language Inference in Cyber-Physical Systems. ACM Trans. Internet Technol., 21(2): 1–25.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Characterization and classification of semantic image-text relations. International Journal of Multimedia Information Retrieval, 9: 31–45.
- Persuasive Reasons in Crowdfunding Campaigns: Comparing Argumentative Strategies in Successful and Unsuccessful Projects on Kickstarter. International Journal of Strategic Communication, 16(2): 332–355.
- Contrastive Language-Image Pre-Training with Knowledge Graphs.
- Knowledge enhanced contextual word representations. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning.
- Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst., 28.
- Robots, W. 2019. Kickstarter Datasets. https://webrobots.io/kickstarter-datasets/.
- Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312.
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565. Melbourne, Australia: Association for Computational Linguistics.
- Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
- ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. arXiv:1612.03975.
- Spence, C. 2011. Crossmodal correspondences: A tutorial review. Attention, Perception, & Psychophysics, 73: 971–995.
- Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197.
- Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.
- Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Categorizing and inferring the relationship between the text and image of twitter posts. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics, 2830–2840.
- Bilateral correspondence model for words-and-pictures association in multimedia-rich microblogs. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 10(4): 1–21.
- Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, 214–229.
- Wickens, C. D. 2008. Multiple resources and mental workload. Human factors, 50(3): 449–455.
- Knowledge-infused learning for entity prediction in driving scenes. Frontiers in big Data, 4: 759110.
- Knowledge-based entity prediction for improved machine perception in autonomous systems. IEEE Intelligent Systems, 37(5): 42–49.
- Multi-Modal Answer Validation for Knowledge-Based VQA. AAAI, 36(3): 2712–2721.
- Visual Question Answering: From Theory to Application. Springer Nature Singapore.
- Wu, S. 2014. A multimodal analysis of image-text relations in picture books. Theory and Practice in Language studies, 4(7): 1415.
- LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention.
- Embedding Entities and Relations for Learning and Inference in Knowledge Bases.
- Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.
- Let’s make your request more persuasive: Modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 3620–3630.
- Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35: 37309–37323.
- A probabilistic graphical model based on neural-symbolic reasoning for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10609–10618.
- A Survey of Knowledge-enhanced Text Generation. ACM Comput. Surv., 54(11s): 1–38.
- Bridging Knowledge Graphs to Generate Scene Graphs. In Computer Vision – ECCV 2020, 606–623. Springer International Publishing.
- Equal but not the same: Understanding the implicit relationship between persuasive images and text. arXiv preprint arXiv:1807.08205.
- VinVL: Revisiting visual representations in vision-language models. 5579–5588.
- Greaselm: Graph reasoning enhanced language models for question answering. arXiv preprint arXiv:2201.08860.