Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning (2406.05318v1)

Published 8 Jun 2024 in cs.CV and cs.AI

Abstract: In this paper, we present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network in solving visuo-linguistic puzzles designed for specially children in the 6-8 age group. Our model is based on two pre-trained models, dedicated to extract features from text and image respectively. To integrate the features from different modalities, we employed a fusion layer with attention mechanism. We explored different text and image pre-trained models, and fine-tune the integrated classifier on the SMART-101 dataset. Experiment results show that under the data splitting style of puzzle split, our proposed integrated classifier achieves superior performance, verifying the effectiveness of multi-modal pre-trained representations.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (2)

Zijian Zhang (125 papers)
Wei Liu (1135 papers)

Tweets

https://twitter.com/realmofresearch/status/1800508398582243768

Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning (2406.05318v1)

Related Papers

Tweets