Soft Self-Consistency Improves Language Model Agents (2402.13212v2)
Abstract: Generations from LLMs can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (SOFT-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. SOFT-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, SOFT-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that SOFT-SC can be applied to both open-source and black-box models.
- A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169.
- Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396, Singapore. Association for Computational Linguistics.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Jacob Andreas. 2022. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Anthropic. 2023. Introducing claude.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Leo Breiman. 1996. Bagging predictors. Machine learning, 24:123–140.
- Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
- Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007.
- Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
- Textworld: A learning environment for text-based games. In The 7th Computer Games Workshop at the 27th International Conference on Artificial Intelligence (IJCAI 2018).
- Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5.
- Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
- Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104.
- Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.
- ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
- Zhenwen Li and Tao Xie. 2024. Using llm to select the right sql query from candidates. arXiv preprint arXiv:2401.02115.
- NL2Bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Sundar Pichai. 2023. An important next step on our ai journey: Google; 2023 [updated 6 feb 2023].
- Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772.
- ReCEval: Evaluating reasoning chains via correctness and informativeness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10066–10086, Singapore. Association for Computational Linguistics.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 14.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE.
- Elias Stengel-Eskin and Benjamin Van Durme. 2023a. Calibrated interpretation: Confidence estimation in semantic parsing. Transactions of the Association for Computational Linguistics, 11:1213–1231.
- Elias Stengel-Eskin and Benjamin Van Durme. 2023b. Did you mean…? confidence-based trade-offs in semantic parsing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2621–2629, Singapore. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Intercode: Standardizing and benchmarking interactive coding with execution feedback. In Advances in Neural Information Processing Systems.
- Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
- Han Wang (420 papers)
- Archiki Prasad (18 papers)
- Elias Stengel-Eskin (49 papers)
- Mohit Bansal (304 papers)