Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models (2406.00977v2)

Published 3 Jun 2024 in cs.CV and cs.AI

Abstract: Recent advances in vision-LLMs (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

Summary

The paper introduces Dragonfly’s multi-resolution encoding strategy that captures both abstract and detailed visual features across three scales.
It employs a zoom-in patch selection mechanism to focus on semantically relevant image regions, reducing noise and redundant information.
Experimental results show state-of-the-art accuracy, including a 92.3% score on the Path-VQA dataset, underscoring its superior visual reasoning capabilities.

An Expert Review of the Paper "Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-LLM"

The paper "Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-LLM" introduces a Large Multimodal Model (LMM) architecture named Dragonfly, designed to enhance fine-grained visual understanding and reasoning about image regions. This paper addresses a significant limitation in existing LMMs, which often downsample high-resolution images, leading to the loss of critical visual information necessary for tasks such as visual commonsense reasoning and biomedical image analysis.

Technical Contributions

The authors highlight two key strategies in the Dragonfly architecture:

Multi-Resolution Visual Encoding: This method involves resizing the original input image into three distinct resolutions (low, medium, and high), allowing the model to capture both abstract and detailed visual information. Each resolution is encoded into visual tokens by a shared vision encoder and then projected into the LLM's latent space.
Zoom-In Patch Selection: The selective approach focuses on high-resolution image patches that are semantically relevant to the query or task at hand. This helps eliminate redundant patches and emphasizes critical regions of the image, thereby maintaining model efficiency and reducing noise.

Experimental Results

The authors validate the efficacy of Dragonfly through rigorous experimentation on eight popular benchmarks. Notable numerical results include achieving 92.3% accuracy on the Path-VQA dataset, surpassing the previous best of 83.3% achieved by Med-Gemini, and the highest reported performance on biomedical image captioning. Dragonfly significantly outperformed its baselines on benchmarks such as AI2D and ScienceQA, demonstrating its superior visual reasoning capabilities.

Implications and Future Developments

The practical implications of Dragonfly's architecture are substantial. In the biomedical domain, the model's adeptness at understanding fine-grained visual details promises advancements in diagnostic tools and medical data interpretation. Theoretical implications suggest that the multi-resolution and selective patch strategies could influence future research on visual instruction alignment and vision-LLM efficiency.

The paper also opens avenues for future AI research, particularly in improving selection strategies during vision-language pretraining. Further research could explore more sophisticated visual encoders and the application of Dragonfly's selective techniques to broader AI tasks. Additionally, optimizing the selection ratio to balance capturing fine details and maintaining image context remains an intriguing challenge.

Conclusion

In conclusion, the paper presents a well-founded and technically effective solution to overcoming the limitations of existing LMMs in processing high-resolution images. With its significant performance improvements on various benchmarks, particularly in fine-grained visual tasks, Dragonfly sets a new standard for LMM architectures. The practical applications in the biomedical field and the potential generalization to other domains underscore the model's versatility and impact.

The codebase and model are available, providing a valuable resource for the research community to build upon this innovative work.

PDF Markdown

Related Papers

Tweets

https://twitter.com/james_y_zou/status/1798801297493422303

https://twitter.com/woojinrad/status/1799501911881605220

https://twitter.com/gm8xx8/status/1798783201680478428