Visually Descriptive Language Model for Vector Graphics Reasoning (2404.06479v4)
Abstract: Despite significant advancements, large multimodal models (LMMs) still struggle to bridge the gap between low-level visual perception -- focusing on shapes, sizes, and layouts -- and high-level language reasoning, such as semantics and logic. This limitation is evident in tasks that require precise visual perception, like comparing geometric properties or solving visual reasoning problems. To study this failure mode, we focus on vector graphics -- images composed of 2D objects and shapes, prevalent in LMM-based tasks in web, design, and OS environments. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To capture fine visual details, we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes. However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To tackle this, we propose the Visually Descriptive LLM (VDLM), which introduces a Primal Visual Description (PVD) as an intermediate textual representation. PVD translates SVGs into a text-based abstraction consisting of primitive attributes (e.g., shape, position, measurement) and their corresponding values. PVD can be learned using task-agnostic synthesized data and represents visual primitives that are universal across vector graphics. This abstraction is more structured, allowing for direct interpretation by foundation models for zero-shot generalization. Without human-annotated data, empirical results show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks. Extensive analyses of VDLM show improved interpretability due to its disentangled perception and reasoning. We also demonstrate a positive correlation between PVD quality and task performance. Project page: https://mikewangwzhl.github.io/VDLM/
- Gemini: A Family of Highly Capable Multimodal Models. arXiv prepring arXiv:2312.11805, 2023. URL https://doi.org/10.48550/arXiv.2312.11805.
- Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. arXiv preprint arXiv:2308.12966, 2023. URL https://doi.org/10.48550/arXiv.2308.12966.
- Improving Fine-Grained Understanding in Image-Text Pre-Training. arXiv prepring arXiv:2401.09865, 2024. URL https://doi.org/10.48550/arXiv.2401.09865.
- Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv prepring arXiv:2303.12712, 2023. URL https://doi.org/10.48550/arXiv.2303.12712.
- Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding. arXiv prepring arXiv:2306.06094, 2023. URL https://doi.org/10.48550/arXiv.2306.06094.
- EpfLLM Megatron-LLM, 2023. URL https://github.com/epfLLM/Megatron-LLM.
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, 2023a. URL https://doi.org/10.48550/arXiv.2311.12793.
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv prepring arXiv:2312.14238, 2023b. URL https://doi.org/10.48550/arXiv.2312.14238.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/9a6a435e75419a836fe47ab6793623e6-Abstract-Conference.html.
- MouSi: Poly-Visual-Expert Vision-Language Models. arXiv prepring arXiv:2401.17221, 2024. URL https://doi.org/10.48550/arXiv.2401.17221.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv prepring arXiv:2306.13394, 2023. URL https://doi.org/10.48550/arXiv.2306.13394.
- G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. arXiv prepring arXiv:2312.11370, 2023. URL https://doi.org/10.48550/arXiv.2312.11370.
- Recursive Visual Programming. arXiv prepring arXiv:2312.02249, 2023. URL https://doi.org/10.48550/arXiv.2312.02249.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. URL https://doi.org/10.1109/CVPR.2017.670.
- HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URL https://arxiv.org/abs/2310.14566.
- Visual Programming: Compositional Visual Reasoning without Training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. URL https://doi.org/10.1109/CVPR52729.2023.01436.
- Geoclidean: Few-Shot Generalization in Euclidean Geometry. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS), 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/feb34ce77fc8b94c85d12e608b23ce67-Abstract-Datasets_and_Benchmarks.html.
- What’s Left? Concept Grounding with Logic-Enhanced Foundation Models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/79fea214543ba263952ac3f4e5452b14-Abstract-Conference.html.
- LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning. arXiv prepring arXiv:2312.10160, 2023. URL https://doi.org/10.48550/arXiv.2312.10160.
- VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a. URL https://doi.org/10.1109/CVPR52729.2023.00190.
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models. arXiv prepring arXiv:2312.14233, 2023b. URL https://doi.org/10.48550/arXiv.2312.14233.
- Mistral 7B. arXiv prepring arXiv:2310.06825, 2023. URL https://doi.org/10.48550/arXiv.2310.06825.
- Segment Anything. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. URL https://doi.org/10.1109/ICCV51070.2023.00371.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1):32–73, 2017. URL https://doi.org/10.1007/s11263-016-0981-7.
- Image Retrieval from Contextual Descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022. URL https://doi.org/10.18653/v1/2022.acl-long.241.
- Joseph B Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
- ShapeWorld - A New Test Methodology for Multimodal Language Understanding. arXiv prepring arXiv:1704.04517, 2017. URL http://arxiv.org/abs/1704.04517.
- Learning Geometry-Aware Representations by Sketching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. URL https://doi.org/10.1109/CVPR52729.2023.02233.
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv prepring arXiv:2307.16125, 2023a. URL https://doi.org/10.48550/arXiv.2307.16125.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (ICML), 2023b. URL https://proceedings.mlr.press/v202/li23q.html.
- Grounded Language-Image Pre-training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. URL https://doi.org/10.1109/CVPR52688.2022.01069.
- Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, 2014. URL https://doi.org/10.1007/978-3-319-10602-1_48.
- Improved Baselines with Visual Instruction Tuning. arXiv prepring arXiv:2310.03744, 2023a. URL https://doi.org/10.48550/arXiv.2310.03744.
- Visual Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 (NeurIPS), 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html.
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv prepring arXiv:2311.05437, 2023c. URL https://doi.org/10.48550/arXiv.2311.05437.
- MMBench: Is Your Multi-modal Model an All-around Player? arXiv prepring arXiv:2307.06281, 2023d. URL https://doi.org/10.48550/arXiv.2307.06281.
- DeepSeek-VL: Towards Real-World Vision-Language Understanding. arXiv prepring arXiv:2403.05525, 2024. URL https://doi.org/10.48550/arXiv.2403.05525.
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models. arXiv prepring arXiv:2310.02255, 2023. URL https://doi.org/10.48550/arXiv.2310.02255.
- Towards Layer-wise Image Vectorization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. URL https://doi.org/10.1109/CVPR52688.2022.01583.
- The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In 7th International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=rJgMlhRctm.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html.
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611, 2024. URL https://doi.org/10.48550/arXiv.2403.09611.
- OpenAI. GPT-4 Technical Report, 2023a.
- OpenAI. GPT-4V(ision) System Card, 2023b. URL https://cdn.openai.com/papers/GPTV_System_Card.pdf.
- DINOv2: Learning Robust Visual Features without Supervision. arXiv prepring arXiv:2304.07193, 2023. URL https://doi.org/10.48550/arXiv.2304.07193.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- Neurally-Guided Procedural Models: Amortized Inference for Procedural Graphics Programs using Neural Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016 (NeurIPS), 2016. URL https://proceedings.neurips.cc/paper/2016/hash/40008b9a5380fcacce3976bf7c08af5b-Abstract.html.
- A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In Computer Vision - ECCV 2022 - 17th European Conference, 2022. URL https://doi.org/10.1007/978-3-031-20074-8_9.
- TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Computer Vision - ECCV 2020 - 16th European Conference, 2020. URL https://doi.org/10.1007/978-3-030-58536-5_44.
- A Corpus of Natural Language for Visual Reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017. URL https://doi.org/10.18653/v1/P17-2034.
- ViperGPT: Visual Inference via Python Execution for Reasoning. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. URL https://doi.org/10.1109/ICCV51070.2023.01092.
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. arXiv prepring arXiv:2401.06209, 2024. URL https://doi.org/10.48550/arXiv.2401.06209.
- Solving Olympiad Geometry without Human Demonstrations. Nat., 625(7995):476–482, 2024. URL https://doi.org/10.1038/s41586-023-06747-5.
- ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. URL https://doi.org/10.1109/ICCV51070.2023.02031.
- CLIPasso: Semantically-aware Object Sketching. ACM Trans. Graph., 41(4):86:1–86:11, 2022. URL https://doi.org/10.1145/3528223.3530068.
- VTracer. Vtracer. URL https://www.visioncortex.org/vtracer-docs.
- Executable Code Actions Elicit Better LLM Agents. arXiv prepring arXiv:2402.01030, 2024. URL https://doi.org/10.48550/arXiv.2402.01030.
- Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Procetss., 13(4):600–612, 2004. URL https://doi.org/10.1109/TIP.2003.819861.
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv prepring arXiv:2303.04671, 2023. URL https://doi.org/10.48550/arXiv.2303.04671.
- Neural Scene De-rendering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. URL https://doi.org/10.1109/CVPR.2017.744.
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs. arXiv prepring arXiv:2312.14135, 2023. URL https://doi.org/10.48550/arXiv.2312.14135.
- If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents. arXiv prepring arXiv:2401.00812, 2024. URL https://doi.org/10.48550/arXiv.2401.00812.
- Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS), 2018. URL https://proceedings.neurips.cc/paper/2018/hash/5e388103a391daabe3de1d76a6739ccd-Abstract.html.
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv prepring arXiv:2308.02490, 2023. URL https://doi.org/10.48550/arXiv.2308.02490.
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv prepring arXiv:2311.16502, 2023. URL https://doi.org/10.48550/arXiv.2311.16502.
- Editing Motion Graphics Video via Motion Vectorization and Transformation. ACM Trans. Graph., 42(6):229:1–229:13, 2023. URL https://doi.org/10.1145/3618316.
- Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. URL https://doi.org/10.18653/v1/2023.findings-acl.85.
- Zhenhailong Wang (17 papers)
- Joy Hsu (15 papers)
- Xingyao Wang (29 papers)
- Kuan-Hao Huang (33 papers)
- Manling Li (47 papers)
- Jiajun Wu (249 papers)
- Heng Ji (266 papers)