Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Soft Self-Consistency Improves Language Model Agents (2402.13212v2)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Generations from LLMs can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (SOFT-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. SOFT-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, SOFT-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that SOFT-SC can be applied to both open-source and black-box models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169.
  2. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396, Singapore. Association for Computational Linguistics.
  3. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  4. Jacob Andreas. 2022. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  5. Anthropic. 2023. Introducing claude.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  7. Leo Breiman. 1996. Bagging predictors. Machine learning, 24:123–140.
  8. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
  9. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007.
  10. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  11. Textworld: A learning environment for text-based games. In The 7th Computer Games Workshop at the 27th International Conference on Artificial Intelligence (IJCAI 2018).
  12. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5.
  13. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  14. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104.
  15. Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.
  16. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825.
  18. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  19. Zhenwen Li and Tao Xie. 2024. Using llm to select the right sql query from candidates. arXiv preprint arXiv:2401.02115.
  20. NL2Bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  21. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  22. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  23. Sundar Pichai. 2023. An important next step on our ai journey: Google; 2023 [updated 6 feb 2023].
  24. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772.
  25. ReCEval: Evaluating reasoning chains via correctness and informativeness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10066–10086, Singapore. Association for Computational Linguistics.
  26. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  27. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  28. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817.
  29. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
  30. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  31. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 14.
  32. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
  33. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  34. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE.
  35. Elias Stengel-Eskin and Benjamin Van Durme. 2023a. Calibrated interpretation: Confidence estimation in semantic parsing. Transactions of the Association for Computational Linguistics, 11:1213–1231.
  36. Elias Stengel-Eskin and Benjamin Van Durme. 2023b. Did you mean…? confidence-based trade-offs in semantic parsing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2621–2629, Singapore. Association for Computational Linguistics.
  37. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
  39. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  40. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  41. Intercode: Standardizing and benchmarking interactive coding with execution feedback. In Advances in Neural Information Processing Systems.
  42. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems.
  43. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  44. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  45. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Han Wang (420 papers)
  2. Archiki Prasad (18 papers)
  3. Elias Stengel-Eskin (49 papers)
  4. Mohit Bansal (304 papers)
Citations (4)

Summary

Enhancement of LLM Agents via Soft Self-Consistency

Introduction to Soft Self-Consistency (Soft-SC)

LLM (LM) agents, when tasked with interactive or multi-step operations, commonly face challenges that can significantly affect their performance and efficiency. Traditional methods like self-consistency (SC) seek to address these by generating multiple solutions and employing majority voting to choose the final answer. However, the effectiveness of SC drops in scenarios with diverse valid solutions due to the inherent requirement for identical actions to tally votes. This paper introduces an innovative approach termed Soft Self-Consistency (Soft-SC) that transcends the limitations of exact-match scoring by integrating a continuous scoring mechanism. This method not only enhances performance but also boosts efficiency, particularly in domains with sparse action spaces. A notable achievement of Soft-SC is its ability to attain better performance with fewer sample requirements compared to SC across various tests.

Methodological Innovations

Soft-SC's Core Concept

Soft-SC diverges from SC's reliance on exact matches for scoring, instead utilizing a continuous score calculated from model likelihoods. This approach enables effective action selection among sparsely distributed options, showcasing its utility in interactive tasks with multiple valid answers per step.

Adaptive Sampling

Soft-SC incorporates an adaptive sampling strategy that dynamically adjusts the number of samples based on the convergence of scores towards a threshold. This refinement not only enhances sample efficiency but also contributes to superior task performance with a smaller sampling footprint.

Empirical Evaluations

The paper's experimental analysis reveals several key findings:

  • Soft-SC consistently outperforms SC and greedy decoding baselines across diverse interactive tasks, demonstrating substantial improvements in success rates with fewer samples.
  • Importantly, Soft-SC's benefits scale with increased model size, suggesting that larger models can further leverage this method for performance gains.
  • Additionally, Soft-SC is adaptable to both open-source and proprietary black-box models, broadening its applicability.

Practical and Theoretical Implications

Soft-SC presents a meaningful advancement in the field of LM agents, particularly for applications involving complex sequences of actions. This method's ability to efficiently handle diversity in valid actions and improve upon existing selection methodologies points to significant potential for enhancing interactive AI systems. Theoretically, Soft-SC's approach to scoring adds a new dimension to understanding how LLMs can be optimized for varied and nuanced tasks, promoting further research into continuous scoring mechanisms.

Future Directions and Considerations

The introduction of Soft-SC opens avenues for future exploration, including its integration with other AI optimization techniques and the extension to more diverse tasks beyond the ones tested. Additionally, considering its performance improvements and efficiency gains, subsequent studies could investigate Soft-SC's applicability in real-world scenarios, where LLM agents are tasked with navigating complex environments or performing intricate sequences of actions.

Conclusion

In summary, Soft Self-Consistency offers a robust and efficient method for improving the performance of LLM agents across a range of interactive tasks. By addressing the limitations inherent in traditional majority voting approaches, Soft-SC provides a compelling solution that enhances both the accuracy and efficiency of LLM agents, setting a new benchmark for future developments in the field.

Github Logo Streamline Icon: https://streamlinehq.com