Introduction
The burgeoning field of Visual Document Understanding (VDU) calls for robust models capable of handling a diversity of document-related tasks. As such, recent research has concentrated on improving models’ abilities to interpret the intricate relationship between textual and visual objects within documents. Despite this focus, creating a universal model that effectively transfers knowledge across various document types, formats, and tasks presents a significant challenge. In particular, most visual instruction tuning datasets and models have been limited, focusing primarily on scene images or lacking the ability to adapt to a wide array of VDU tasks. Aiming to bridge this gap, a novel approach has merged human-written instructions with visual documents to drive model generalization across unencountered VDU tasks.
InstructDoc Dataset and Model Advancement
The paper introduces InstructDoc, a pioneering dataset designed to foster zero-shot generalization in VDU tasks through the utilization of instructions. InstructDoc encompasses an extensive range of 12 tasks derived from 30 diverse datasets, all formulated within a uniform instruction schema. This schema is poised to require a complex set of competencies from models, such as grasping document layouts and interpreting visual representations of texts and objects. Notably, the novel model termed InstructDr has been developed to leverage this dataset. InstructDr integrates document images, image encoders, and LLMs via a trainable bridging module coined the Document-former. This module is key to transforming documents into representations digestible by LLMs, subsequently enhancing the models' zero-shot performance across VDU tasks when supplied with instructions.
Architectural Innovations and Empirical Evaluations
InstructDr, through its Document-former, is adept at mapping visual and textual document features into a space interpretable by an LLM. Experimental results reveal that InstructDr significantly surpasses the zero-shot performance of current multimodal LLMs and outperforms ChatGPT in numerous VDU tasks when aided by instructions. Such outcomes underscore the efficacy of instructions in improving model generalization and robustness. The model's architecture also supports multi-page document comprehension by encoding multiple document images in parallel, thereby enabling intricate reasoning across pages.
Critical Reflections and Future Prospects
Despite the merits of InstructDr, the research recognizes limitations, including a dependency on OCR quality and the inability to account for correlations among multiple document-text pairs. Furthermore, the possibility to enrich the dataset with automated instruction generation and augmentation remains unexplored. In summary, the advent of InstructDoc and the development of InstructDr signify a pivotal stride toward realizing general-purpose VDU models that comprehend and execute tasks guided by natural language instructions. This research crystallizes a vital contribution to the evolution of document AI, arguably setting a new benchmark for ensuing works in the discipline.