Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model (2106.15332v1)
Abstract: TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. In this challenge, we use generative model T5 for TextVQA task. Based on pre-trained checkpoint T5-3B from HuggingFace repository, two other pre-training tasks including masked LLMing(MLM) and relative position prediction(RPP) are designed to better align object feature and scene text. In the stage of pre-training, encoder is dedicate to handle the fusion among multiple modalities: question text, object text labels, scene text labels, object visual features, scene visual features. After that decoder generates the text sequence step-by-step, cross entropy loss is required by default. We use a large-scale scene text dataset in pre-training and then fine-tune the T5-3B with the TextVQA dataset only.
- Yixuan Qiao (10 papers)
- Hao Chen (1006 papers)
- Jun Wang (991 papers)
- Yihao Chen (40 papers)
- Xianbin Ye (6 papers)
- Ziliang Li (8 papers)
- Xianbiao Qi (38 papers)
- Peng Gao (402 papers)
- Guotong Xie (31 papers)