UniT: Multimodal Multitask Learning with a Unified Transformer (2102.10772v3)

Published 22 Feb 2021 in cs.CV and cs.CL

Abstract: We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.

PDF Abstract

Overview of UniT: Multimodal Multitask Learning with a Unified Transformer

The paper "UniT: Multimodal Multitask Learning with a Unified Transformer" proposes an innovative approach to unify the handling of various tasks across different domains using a single model architecture, termed UniT. UniT is positioned as a significant stride towards achieving general intelligence by employing a unified model to simultaneously manage tasks traditionally considered distinct. This model leverages a transformer-based encoder-decoder architecture to address tasks in vision, language, and multimodal reasoning domains.

Model Architecture and Training

UniT operates using a transformer architecture that comprises separate encoders for processing input modalities—namely visual and textual inputs—and a shared transformer decoder tasked with both unimodal and multimodal tasks. The architecture allows the model to share parameters across tasks and domains, unlike traditional models that usually fine-tune task-specific parameters, resulting in parameter inefficiency when handling multiple tasks. The UniT model's decoders and encoders collectively form an end-to-end trainable framework that jointly learns tasks across multiple datasets simultaneously. The innovative architecture includes task-specific output heads, allowing the model to adapt its predictions to the requirements of each task.

Experimental Evaluation and Results

The model is evaluated across seven distinct tasks using eight different datasets, which include COCO for object detection, Visual Genome for object and attribute detection, SNLI-VE for visual entailment, and VQAv2 for visual question answering, alongside language understanding tasks from the GLUE benchmark (QNLI, QQP, MNLI, SST-2). This impressive breadth of evaluation underscores the versatility of the UniT architecture. The results demonstrate that UniT achieves competitive performance across most tasks compared to specialized models such as DETR for vision tasks and VisualBERT for multimodal tasks, showcasing the potential efficacy of a unified model approach.

Key Insights and Implications

One of the fundamental insights from the experimental results is that while UniT achieves strong performance across a wide array of tasks, it particularly excels in multimodal tasks like visual question answering and visual entailment. This suggests that the shared architecture is beneficial in scenarios where tasks require integrating information from multiple modalities. Moreover, the paper highlights how UniT's ability to use fewer parameters provides a practical advantage in terms of scalability and memory efficiency, which is typically a concern with large models.

However, for pure vision or language tasks, the model's performance, although acceptable, does not surpass traditional specialized models. This emphasizes the potential trade-off inherent in using a unified model: while parameter efficiency and cross-task ability are enhanced, individual task performance may not reach the forefront established by domain-specific architectures.

Future Directions

The introduction of UniT impressively lays a foundation for further research into unified transformer models that can handle more diverse and complex sets of tasks with even greater efficiency. Future work might focus on refining the task compatibility and performance balance, exploring hierarchical learning techniques, and investigating automated task sampling strategies to enhance performance further and rectify the minor regressions experienced in certain domain-specific tasks compared to specialized models.

In conclusion, UniT presents a forward-thinking approach to handling multiple tasks across domains with transformers, providing a framework that, with continued refinement and exploration, could robustly contribute to advancing the capabilities and applications of artificial general intelligence systems.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Ronghang Hu (26 papers)
Amanpreet Singh (36 papers)

Citations (270)

View on Semantic Scholar