Papers
Topics
Authors
Recent
2000 character limit reached

TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks (2401.12869v1)

Published 23 Jan 2024 in cs.AI

Abstract: LLMs (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we ask code LMs to curate reusable high-level functions, and use them to write solutions. We present TROVE, a training-free method of inducing a verifiable and efficient toolbox of functions, by generating via using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TROVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions. Transactions of the Association for Computational Linguistics, 2013. URL https://doi.org/10.1162/tacl_a_00209.
  2. Top-down synthesis for library learning. Proc. ACM Program. Lang., 7(POPL), jan 2023. doi: 10.1145/3571234. URL https://doi.org/10.1145/3571234.
  3. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023. URL https://arxiv.org/pdf/2305.17126.
  4. Api-assisted code generation for question answering on varied table structures. arXiv preprint arXiv:2310.14687, 2023. URL https://arxiv.org/pdf/2310.14687.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. HiTab: A hierarchical table dataset for question answering and natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022.
  7. Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=lH1PV42cbF.
  8. Dreamcoder: growing generalizable, interpretable knowledge with wake–sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381(2251):20220050, 2023.
  9. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023a.
  10. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023b.
  11. Lilo: Learning interpretable libraries by compressing and documenting code. arXiv preprint arXiv:2310.19791, 2023.
  12. Visual programming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. URL https://arxiv.org/pdf/2211.11559.
  13. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. URL https://arxiv.org/pdf/2103.03874.
  14. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  15. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022a.
  16. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022b.
  17. Learning minimal abstractions. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp.  31–42, 2011.
  18. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2023. URL https://arxiv.org/pdf/2209.14610.
  19. Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134, 2023.
  20. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp. 26106–26128. PMLR, 2023.
  21. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, July 2015. URL https://aclanthology.org/P15-1142.
  22. Creator: Disentangling abstract and concrete reasonings of large language models through tool creation. arXiv preprint arXiv:2305.14318, 2023. URL https://arxiv.org/pdf/2305.14318.
  23. Natural language to code translation with execution. In Proceedings of EMNLP. Association for Computational Linguistics, December 2022. URL https://aclanthology.org/2022.emnlp-main.231.
  24. Program synthesis and semantic parsing with learned code idioms, 2019.
  25. Modular visual question answering via code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), July 2023.
  26. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  27. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  28. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=1PL1NIMMrw.
  29. Leveraging language to learn program abstractions and search heuristics. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11193–11204. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/wong21a.html.
  30. Lego-prover: Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2023.
  31. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  32. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696, 2017.
  33. Craft: Customizing llms by creating and retrieving from specialized toolsets. arXiv preprint arXiv:2309.17428, 2023.
  34. Online learning of relaxed ccg grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.  678–687, 2007.
Citations (12)

Summary

  • The paper introduces a zero-training approach that iteratively builds a dynamic toolbox of high-level functions without extra supervision.
  • The methodology employs execution agreement and periodic trimming, achieving 31% faster human verification and 13% higher accuracy than baselines.
  • TroVE simplifies code generation by producing leaner, simpler solutions across diverse tasks, demonstrating broad applicability.

Introduction

LLMs (LMs) have been progressively utilized in the field of code generation, where their utility ranges from answering questions about structured data to performing image reasoning tasks, underpinned by the ability to compose programs in languages like Python. Nevertheless, a challenge persists as models relying on low-level or "primitive" functions typically produce verbose and error-prone code, while high-level, abstract functions necessitate expert construction. The reliance on such functions can lead to inefficiencies—both in solution complexity and in the subsequent human verification processes. In light of this, the introduced work—titled "TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks"—promises a novel method to alleviate these issues.

Methodology

TroVE posits a zero-training approach that constructs a dynamic toolbox of high-level functions through a mechanism that involves generating, utilizing, pruning, and constantly refining these tool constructs over time. This method stands apart by requiring no additional training or supervision, as it iteratively builds the toolbox while solving a stream of questions. The approach is structured around three core components: (1) the iterative use and expansion of a toolbox across examples, (2) an execution agreement-based selection criterion for optimal outputs, and (3) periodic trimming to discard low-utility functions. This methodology was tested on 11 datasets encompassing mathematical problem-solving, table question answering, and visual reasoning tasks, revealing that TroVE consistently attained higher accuracy and produced solutions of lower complexity with substantially smaller toolboxes compared to baselines using GPT models and CodeLLaMa.

Results

TroVE's results are compelling. When juxtaposed with baseline methods, TroVE significantly simplifies the verification process, rendering it 31% faster and 13% more accurate for human validators. Moreover, it was consistently superior in generating simpler, more accurate solutions while maintaining a leaner function library. TroVE's approach to cultivating specialized functions shows adaptability across tasks and datasets, indicating its capacity for wide-ranging applications. Crucially, TroVE exhibited robust performance irrespective of example ordering and demonstrated efficiency gains via its periodic toolbox trimming mechanism, which ultimately curtails the proliferation of redundant functions.

Implications and Conclusion

The TroVE framework represents a substantial advancement in the automation of function curation for LMs in code generation contexts. The strategy it embodies streamlines the creation of expressive, high-level functions without encumbering human intervention, and it finely balances the trade-offs between model performance, solution complexity, and library size. It promotes a more efficient human verification process, which is essential as LLMs increasingly become collaborators in coding workflows. The research successively paves a path toward the enhancement of LLMs as autonomous agents in programming, equipped with the ability to induce, apply, and manage sophisticated abstract functionalities. The work holds considerable promise for streamlining programmatic problem-solving and potentially transforming how we approach coding tasks with AI assistance.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 63 likes about this paper.