Papers
Topics
Authors
Recent
2000 character limit reached

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback (2311.12890v3)

Published 21 Nov 2023 in cs.CV

Abstract: Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Tallyqa: Answering complex counting questions. arXiv preprint arXiv:1810.12440, 2018.
  2. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, pages 23716–23736. Curran Associates, Inc., 2022.
  3. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016.
  4. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  5. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  6. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  7. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  8. Code reuse in open source software. Management science, 54(1):180–193, 2008.
  9. A retrieve-and-edit framework for predicting structured outputs. Advances in Neural Information Processing Systems, 31, 2018.
  10. Retrieval-based neural code generation. arXiv preprint arXiv:1808.10025, 2018.
  11. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  12. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pages 2989–2998, 2017.
  13. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  14. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  16. Skcoder: A sketch-based approach for automatic code generation. arXiv preprint arXiv:2302.06144, 2023b.
  17. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022a.
  18. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022b.
  19. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  20. Visual instruction tuning, 2023.
  21. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  22. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  23. Audris Mockus. Large-scale code reuse in open source software. In First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS’07: ICSE Workshops 2007), pages 7–7. IEEE, 2007.
  24. OpenAI. Ghatgpt. https://openai.com/chatgpt.
  25. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
  28. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
  29. Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991, 2022.
  30. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  31. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  32. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, page 1433–1443, New York, NY, USA, 2020. Association for Computing Machinery.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  34. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  35. Ufo: unified feature optimization. In European Conference on Computer Vision, pages 472–488. Springer, 2022.
  36. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
  37. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  38. Visually-prompted language model for fine-grained scene graph generation in an open world. arXiv preprint arXiv:2303.13233, 2023.
  39. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.