Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models (2312.03052v2)

Published 5 Dec 2023 in cs.CV and cs.CL
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Abstract: Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a LLM into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-LLM (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

Introduction

In the field of vision-LLMs (VLMs), advancements have been significant, yet the complexity of certain visual tasks still presents a challenge. These tasks require not just object identification but also spatial understanding and the retrieval of contextual knowledge. Although LLMs have shown aptitude in generating executable code to tackle sophisticated tasks, the programs they produce are prone to errors and inefficiencies, often missing crucial steps or including unnecessary ones. To overcome these obstacles and minimize computational costs, a novel framework called Visual Program Distillation (VPD) has been proposed.

Program Generation and Verification

VPD begins by generating multiple candidate programs using an LLM to solve a given task. These programs are then executed using specialized vision modules. A verification process follows to identify the correct program. For tasks with available labeled data, programs are filtered based on their output's correctness. Execution traces of the programs are recorded, detailing the usage of various vision tools during the process.

Distilling Step-by-Step

After identifying the correct program for a task, the next phase involves translating the program's execution trace into a natural language description of the reasoning steps, often referred to as a chain-of-thought (CoT). This CoT is then distilled into the VLM, with the aim of imbuing it with the same programmatic reasoning capabilities. This distillation process is crucial for improving the VLM’s abilities to count, decipher spatial relationships, and perform compositional reasoning.

Empirical Evidence of Efficiency

The VPD-trained model, referred to as PaLI-X-VPD, demonstrates state-of-the-art performance across several complex vision tasks, surpassing previous VLMs. It achieves this while also providing human-readable reasoning steps. Human annotators confirm that VPD enhances the factuality and consistency of model responses. Separate experiments in content moderation indicate the versatility of VPD, showcasing its utility in real-world applications even with limited data availability. The framework's inherent ability to generate accurate executable programs and distill complex reasoning into VLMs reveals its potential as a transformative approach in the field of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, pages 8076–8084, 2019.
  2. Flamingo: a visual language model for few-shot learning, 2022.
  3. Palm 2 technical report, 2023.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023a.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023b.
  8. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2022a.
  9. Pali: A jointly-scaled multilingual language-image model, 2022b.
  10. Pali-x: On scaling up a multilingual vision and language model, 2023c.
  11. Pali-3 vision language models: Smaller, faster, stronger, 2023d.
  12. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022.
  13. Palm: Scaling language modeling with pathways, 2022.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  15. Scaling vision transformers to 22 billion parameters, 2023.
  16. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  17. DepthLab: Real-Time 3D Interaction With Depth Maps for Mobile Augmented Reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. ACM, 2020.
  18. Layoutgpt: Compositional visual planning and generation with large language models. ArXiv, abs/2305.15393, 2023.
  19. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  20. Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  21. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.
  22. Lora: Low-rank adaptation of large language models, 2021.
  23. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022a.
  24. In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates, 2022b. Association for Computational Linguistics.
  25. Avis: Autonomous visual information seeking with large language model agent, 2023.
  26. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  27. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017.
  28. Evaluating open-domain question answering in the era of large language models, 2023.
  29. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, 2020.
  30. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  31. Segment anything, 2023.
  32. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  33. Lisa: Reasoning segmentation via large language model, 2023.
  34. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023a.
  35. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv, abs/1908.03557, 2019.
  36. Evaluating object hallucination in large vision-language models, 2023b.
  37. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  38. Improved baselines with visual instruction tuning, 2023a.
  39. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  40. Mmbench: Is your multi-modal model an all-around player?, 2023c.
  41. Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022.
  42. Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
  43. Faithful chain-of-thought reasoning, 2023.
  44. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019a.
  45. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019b.
  46. Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning, 2023.
  47. Scaling open-vocabulary object detection, 2023.
  48. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  49. OpenAI. Gpt-4 technical report, 2023.
  50. Kosmos-2: Grounding multimodal large language models to the world, 2023.
  51. Toolformer: Language models can teach themselves to use tools, 2023.
  52. A-okvqa: A benchmark for visual question answering using world knowledge, 2022a.
  53. A-okvqa: A benchmark for visual question answering using world knowledge. arXiv preprint arXiv:2206.01718, 2022b.
  54. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
  55. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
  56. Modular visual question answering via code generation, 2023.
  57. UL2: Unifying Language Learning Paradigms, 2022.
  58. Llama 2: Open foundation and fine-tuned chat models, 2023.
  59. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  60. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022b.
  61. Scott: Self-consistent chain-of-thought distillation, 2023a.
  62. Cogvlm: Visual expert for pretrained language models, 2023b.
  63. Self-instruct: Aligning language models with self-generated instructions, 2022c.
  64. Non-intrusive adaptation: Input-centric parameter-efficient fine-tuning for versatile multimodal modeling, 2023c.
  65. Chain-of-thought prompting elicits reasoning in large language models, 2022.
  66. Autogen: Enabling next-gen llm applications via multi-agent conversation. 2023.
  67. Lemur: Harmonizing natural language and code for language agents, 2023.
  68. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
  69. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023a.
  70. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023b.
  71. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  72. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  73. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems, pages 30378–30392. Curran Associates, Inc., 2022.
  74. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  75. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  76. Sigmoid loss for language image pre-training, 2023.
  77. Glipv2: Unifying localization and vision-language understanding, 2022.
  78. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023a.
  79. Pad: Program-aided distillation specializes large models in reasoning, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yushi Hu (23 papers)
  2. Otilia Stretcu (10 papers)
  3. Chun-Ta Lu (20 papers)
  4. Krishnamurthy Viswanathan (3 papers)
  5. Kenji Hata (13 papers)
  6. Enming Luo (11 papers)
  7. Ranjay Krishna (116 papers)
  8. Ariel Fuxman (10 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com