Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation (2405.16552v1)

Published 26 May 2024 in cs.CL and cs.AI

Abstract: Existing LLMs generate text through unidirectional autoregressive decoding methods to respond to various user queries. These methods tend to consider token selection in a simple sequential manner, making it easy to fall into suboptimal options when encountering uncertain tokens, referred to as chaotic points in our work. Many chaotic points exist in texts generated by LLMs, and they often significantly affect the quality of subsequently generated tokens, which can interfere with LLMs' generation. This paper proposes Self-Evaluation Decoding, SED, a decoding method for enhancing model generation. Analogous to the human decision-making process, SED integrates speculation and evaluation steps into the decoding process, allowing LLMs to make more careful decisions and thus optimize token selection at chaotic points. Experimental results across various tasks using different LLMs demonstrate SED's effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  2. Learning from mistakes makes LLM better reasoner. CoRR, abs/2310.20689, 2023. doi: 10.48550/ARXIV.2310.20689. URL https://doi.org/10.48550/arXiv.2310.20689.
  3. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023a. doi: 10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805.
  4. Palm 2 technical report. CoRR, abs/2305.10403, 2023b. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  5. Qwen technical report. CoRR, abs/2309.16609, 2023. doi: 10.48550/ARXIV.2309.16609. URL https://doi.org/10.48550/arXiv.2309.16609.
  6. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation. CoRR, abs/2403.01548, 2024. doi: 10.48550/ARXIV.2403.01548. URL https://doi.org/10.48550/arXiv.2403.01548.
  7. Dola: Decoding by contrasting layers improves factuality in large language models. CoRR, abs/2309.03883, 2023. doi: 10.48550/ARXIV.2309.03883. URL https://doi.org/10.48550/arXiv.2309.03883.
  8. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  9. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen, editors, Proceedings of the 2nd Workshop on Machine Reading for Question Answering, MRQA@EMNLP 2019, Hong Kong, China, November 4, 2019, pages 1–13. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-5801. URL https://doi.org/10.18653/v1/D19-5801.
  10. Gptscore: Evaluate as you desire. CoRR, abs/2302.04166, 2023. doi: 10.48550/ARXIV.2302.04166. URL https://doi.org/10.48550/arXiv.2302.04166.
  11. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 18099–18107. AAAI Press, 2024. doi: 10.1609/AAAI.V38I16.29767. URL https://doi.org/10.1609/aaai.v38i16.29767.
  12. Small language model can self-correct. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 18162–18170. AAAI Press, 2024. doi: 10.1609/AAAI.V38I16.29774. URL https://doi.org/10.1609/aaai.v38i16.29774.
  13. Deal: Decoding-time alignment for large language models. CoRR, abs/2402.06147, 2024. doi: 10.48550/ARXIV.2402.06147. URL https://doi.org/10.48550/arXiv.2402.06147.
  14. Comparison of diverse decoding methods from conditional language models. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3752–3762. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1365. URL https://doi.org/10.18653/v1/p19-1365.
  15. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
  16. Daniel Kahneman. Thinking, Fast and Slow. Farrar, Straus and Giroux, United States, 2011. ISBN 9780374275631.
  17. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I, pages 99–127. World Scientific, 2013.
  18. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. CoRR, abs/2311.18702, 2023. doi: 10.48550/ARXIV.2311.18702. URL https://doi.org/10.48550/arXiv.2311.18702.
  19. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  20. Generative judge for evaluating alignment. CoRR, abs/2310.05470, 2023a. doi: 10.48550/ARXIV.2310.05470. URL https://doi.org/10.48550/arXiv.2310.05470.
  21. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 12286–12312. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.687. URL https://doi.org/10.18653/v1/2023.acl-long.687.
  22. Critique ability of large language models. CoRR, abs/2310.04815, 2023. doi: 10.48550/ARXIV.2310.04815. URL https://doi.org/10.48550/arXiv.2310.04815.
  23. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  24. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-Datasets_and_Benchmarks.html.
  25. Zero: memory optimizations toward training trillion parameter models. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020. doi: 10.1109/SC41405.2020.00024. URL https://doi.org/10.1109/SC41405.2020.00024.
  26. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM, 2020. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  27. FLAP: flow-adhering planning with constrained decoding in llms. CoRR, abs/2403.05766, 2024. doi: 10.48550/ARXIV.2403.05766. URL https://doi.org/10.48550/arXiv.2403.05766.
  28. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  29. Jiaye Wang. Constraining large language model for generating computer-parsable content. arXiv preprint arXiv:2404.05499, 2024.
  30. Large language models are not fair evaluators. CoRR, abs/2305.17926, 2023a. doi: 10.48550/ARXIV.2305.17926. URL https://doi.org/10.48550/arXiv.2305.17926.
  31. Shepherd: A critic for language model generation. CoRR, abs/2308.04592, 2023b. doi: 10.48550/ARXIV.2308.04592. URL https://doi.org/10.48550/arXiv.2308.04592.
  32. Pandalm: An automatic evaluation benchmark for LLM instruction tuning optimization. CoRR, abs/2306.05087, 2023c. doi: 10.48550/ARXIV.2306.05087. URL https://doi.org/10.48550/arXiv.2306.05087.
  33. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  34. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. CoRR, abs/2401.07851, 2024. doi: 10.48550/ARXIV.2401.07851. URL https://doi.org/10.48550/arXiv.2401.07851.
  35. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline. CoRR, abs/2404.02893, 2024. doi: 10.48550/ARXIV.2404.02893. URL https://doi.org/10.48550/arXiv.2404.02893.
  36. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305, 2023. doi: 10.48550/ARXIV.2309.10305. URL https://doi.org/10.48550/arXiv.2309.10305.
  37. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D18-1259. URL https://doi.org/10.18653/v1/d18-1259.
  38. Yi: Open foundation models by 01.ai. CoRR, abs/2403.04652, 2024. doi: 10.48550/ARXIV.2403.04652. URL https://doi.org/10.48550/arXiv.2403.04652.
  39. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. doi: 10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284.
  40. A survey of large language models. CoRR, abs/2303.18223, 2023. doi: 10.48550/ARXIV.2303.18223. URL https://doi.org/10.48550/arXiv.2303.18223.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
  42. Judgelm: Fine-tuned large language models are scalable judges. CoRR, abs/2310.17631, 2023. doi: 10.48550/ARXIV.2310.17631. URL https://doi.org/10.48550/arXiv.2310.17631.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ziqin Luo (5 papers)
  2. Haixia Han (4 papers)
  3. Haokun Zhao (4 papers)
  4. Guochao Jiang (12 papers)
  5. Chengyu Du (15 papers)
  6. Tingyun Li (2 papers)
  7. Jiaqing Liang (62 papers)
  8. Deqing Yang (55 papers)
  9. Yanghua Xiao (151 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets