Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation (2407.02056v1)

Published 2 Jul 2024 in cs.CL and cs.AI

Abstract: Self-consistency (SC), leveraging multiple samples from LLMs, shows significant gains on various reasoning tasks but struggles with free-form generation due to the difficulty of aggregating answers. Its variants, UCS and USC, rely on sample selection or voting mechanisms to improve output quality. These methods, however, face limitations due to their inability to fully utilize the nuanced consensus knowledge present within multiple candidate samples, often resulting in suboptimal outputs. We propose Fine-Grained Self-Consistency (FSC) to addresses these limitations by extracting and integrating segment-level commonalities from candidate samples, enhancing the performance of LLMs both in open-ended and reasoning tasks. Based on this, we present two additional strategies: candidate filtering, which enhances overall quality by identifying highly similar candidate sets, and merging, which reduces input token requirements by combining similar samples. The effectiveness of FSC is demonstrated through extensive experiments on various tasks, including summarization, code generation, and mathematical reasoning, using GPT-3.5-turbo and GPT-4. The results indicate significant improvements over baseline methods, showcasing the potential of FSC to optimize output quality by effectively synthesizing fine-grained consensus knowledge from multiple samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Program synthesis with large language models. CoRR, abs/2108.07732.
  3. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  4. SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
  5. Universal self-consistency for large language model generation. CoRR, abs/2311.17311.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. News summarization and evaluation in the era of GPT-3. CoRR, abs/2209.12356.
  8. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  9. Enhancing large language models in coding through multi-perspective self-consistency. arXiv preprint arXiv:2309.17272.
  10. Self-consistency for open-ended generations. CoRR, abs/2307.06857.
  11. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111.
  12. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations.
  13. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  14. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210.
  15. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  16. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  17. Strength in numbers: Estimating confidence of large language models by prompt agreement. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 326–362, Toronto, Canada. Association for Computational Linguistics.
  18. Unsupervised summarization re-ranking. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8341–8376, Toronto, Canada. Association for Computational Linguistics.
  19. Follow the wisdom of the crowd: Effective text generation via minimum Bayes risk decoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4265–4293, Toronto, Canada. Association for Computational Linguistics.
  20. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  21. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  22. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  23. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007.
  24. Batcheval: Towards human-like text evaluation. arXiv preprint arXiv:2401.00437.
  25. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. arXiv preprint arXiv:2310.03094.
  26. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  27. Self-contrast: Better reflection through inconsistent solving perspectives. arXiv preprint arXiv:2401.02009.
  28. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797.
  29. Prompt consistency for zero-shot task generalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2613–2626, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xinglin Wang (22 papers)
  2. Yiwei Li (107 papers)
  3. Shaoxiong Feng (32 papers)
  4. Peiwen Yuan (20 papers)
  5. Boyuan Pan (30 papers)
  6. Heda Wang (12 papers)
  7. Yao Hu (106 papers)
  8. Kan Li (54 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.