Modular Visual Question Answering via Code Generation
The paper "Modular Visual Question Answering via Code Generation" presents a novel framework for addressing Visual Question Answering (VQA) through a modular approach that employs code generation. This methodology contrasts with traditional modular approaches such as differentiable neural module networks, which require significant retraining when modules are added or modified. The work leverages pre-trained LLMs and visual models that do not necessitate additional training, establishing a system that can solve complex VQA tasks by synthesizing responses via code execution.
Methodology
The proposed framework utilizes large pre-trained LLMs like Codex for the generation of Python programs. These programs orchestrate pre-defined visual primitives that interface with Visual LLMs (VLMs) to process and analyze image data. The key operations involve generating a logical breakdown of visual tasks that can include arithmetic or conditional logic, essentially transforming the VQA into a form of program synthesis.
The researchers introduce a suite of visual primitives:
- query(image, question): Provides answers to questions about an image through iterative image patch captioning and LLM-based question answering.
- get_pos(image, text): Uses localization techniques to determine the position of objects within an image.
- find_matching_image(images, text): Identifies the most related image in a set to a given text using image-text similarity scores.
Results
Evaluations on the COVR and GQA datasets highlight the experimental gains brought about by this framework. The approach improved accuracy by 3% on the COVR dataset and roughly 2% on the GQA dataset compared to few-shot baselines. Moreover, the improvement is particularly significant in dealing with questions that involve spatial relationships or multiple conditions, reflecting the potential of modularity in reasoning tasks.
Implications
The implications of this research are twofold. Practically, the modular system introduced is versatile and easily adaptable for a wide array of VQA challenges, benefiting from the latest advancements in vision and LLMs without necessitating model re-training. Theoretically, it showcases the value of program synthesis in combining pre-trained models’ capabilities for multi-modal reasoning.
Future Directions
This work paves the way for future research in developing more sophisticated and nuanced primitives that could handle broader classes of reasoning or incorporate external libraries for additional functionality. Extending this framework into non-English language settings remains another prospective area for exploration. Additionally, addressing the computation and cost limitations inherent in deploying LLMs in real-world scenarios will be crucial to harnessing the full potential of this approach.