Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution (2405.19325v2)

Published 29 May 2024 in cs.CL
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Abstract: LLMs often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric LLMing approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.

Overview of "Nearest Neighbor Speculative Decoding"

This paper introduces Nearest Neighbor Speculative Decoding (Nest), an innovative approach in semi-parametric LLMing that combines traditional LLMs (LMs) with retrieval-augmented methods to generate more accurate and reliably attributed content. This work leverages retrieval augmentation by incorporating non-parametric data stores to refine and ground the LM's predictions, addressing the well-documented issues of hallucination in LMs.

Key Contributions

  1. Novel Architecture: Nest extends the kkNN-LM framework through a series of enhancements, including a two-stage retrieval process for more efficient and accurate token prediction. The architecture incorporates a confidence-based interpolation mechanism, dynamic span selection, and a relaxed speculative decoding method.
  2. Two-Stage Retrieval: This method performs an initial passage retrieval step to narrow down the search space, followed by kk-nearest neighbor (kk-NN) token retrieval. This approach balances accuracy and efficiency, improving generation latency while requiring less storage and computation compared to maintaining a token-level key-value store.
  3. Confidence-Based Interpolation: Nest introduces Relative Retrieval Confidence (RRC) to dynamically adjust the balance between the LM's inherent distribution and the retrieval-augmented distribution. This adaptability ensures better performance across divergent downstream tasks.
  4. Dynamic Span Selection: Inspired by the Copy Generator (CoG) approach, Nest can select not just the next token but a sequence of tokens (or spans) when the retrieval confidence is sufficiently high. This mechanism significantly enhances coherence and attribution in the generated text while also improving efficiency.
  5. Relaxed Speculative Decoding: Building on speculative decoding principles, Nest employs an evaluation phase to accept or reject spans based on a confidence threshold, thereby maintaining high fluency and factuality in the output.

Experimental Validation

The authors conducted extensive evaluations using various free-form generation tasks, such as question answering and text completion, employing different-sized Llama-2-Chat models under zero-shot settings. The paper highlights several key results:

  • Performance Gains: On WikiText-103, Nest combined with Llama-2-Chat 70B exhibited a 42.3% improvement in ROUGE-1 and a 21.6% enhancement in FActScore on the Biography dataset compared to its base model.
  • Efficiency: The speculative decoding technique accelerates the generation process, achieving a 1.8x speedup in inference time for long-form generation without compromising textual quality or factual attribution.
  • Attribution: The incorporation of spans from verified sources ensures that a large proportion of the generated text can be directly attributed back to the corpus, enhancing the credibility of the output.

Implications and Future Directions

The implications of Nest are multifaceted, impacting both practical applications and theoretical advancements in AI:

  • Practical Implications: Nest demonstrates significant potential in applications requiring high factual accuracy and reliable attribution, such as automated journalism, content summarization in legal and medical domains, and educational tools. The ability to source content directly from a non-parametric store helps in mitigating hallucination and supports users in identifying the provenance of information.
  • Theoretical Implications: The introduction of a two-stage retrieval mechanism and confidence-based interpolation opens new avenues for blending parametric and non-parametric methods efficiently. Future research might explore more sophisticated interpolation techniques and the scalability of such hybrid models to even larger datasets and more complex retrieval tasks.
  • Efficiency Bridging: The paper contributes to ongoing discussions on improving the efficiency of generation models. By parallelizing the token processing and effectively balancing retrieval and generation stages, Nest demonstrates optimized performance which is crucial for real-time applications.

Conclusion

Nest represents a significant stride in semi-parametric LLMing, addressing key limitations of existing retrieval-augmented methods. Its validated improvements in both generation quality and efficiency underscore its potential for broad applicability and future exploration in enhancing LM capabilities. This work sets the stage for more robust and accountable AI-driven text generation systems, promoting reliable and factual generation in a variety of practical contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  2. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  3. Llama: Open and efficient foundation language models, 2023a.
  4. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  5. Gemini: A family of highly capable multimodal models, 2024.
  6. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  7. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, Toronto, Canada, July 2023a. Association for Computational Linguistics. 10.18653/v1/2023.acl-tutorials.6. URL https://aclanthology.org/2023.acl-tutorials.6.
  8. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations (ICLR), 2020.
  9. Improving language models by retrieving from trillions of tokens, 2022.
  10. REPLUG: Retrieval-augmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2024a.
  11. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2024b.
  12. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023. 10.1162/tacl_a_00605. URL https://aclanthology.org/2023.tacl-1.75.
  13. k𝑘kitalic_kNN-LM does not improve open-ended text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15023–15037, Singapore, December 2023a. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.929. URL https://aclanthology.org/2023.emnlp-main.929.
  14. Copy is all you need. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CROlOA9Nd8C.
  15. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/leviathan23a.html.
  16. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  17. Another look at dpr: Reproduction of training and replication of retrieval. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, page 613–626, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-030-99735-9. 10.1007/978-3-030-99736-6_41. URL https://doi.org/10.1007/978-3-030-99736-6_41.
  18. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  19. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6138–6148, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.496. URL https://aclanthology.org/2021.emnlp-main.496.
  20. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  21. Pile of law: Learning responsible data filtering from the law and a 256GB open-source legal dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=3HCT3xfNm9r.
  22. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  23. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR.
  24. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  25. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  26. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  27. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR, 07–08 Apr 2022. URL https://proceedings.mlr.press/v174/pal22a.html.
  28. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore, December 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  29. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  30. Measuring massive multitask language understanding, 2021.
  31. Atlas: few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24(1), mar 2024. ISSN 1532-4435.
  32. RA-DIT: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=22OTbutug9.
  33. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6385–6400, Singapore, December 2023. Association for Computational Linguistics. 10.18653/v1/2023.findings-emnlp.423. URL https://aclanthology.org/2023.findings-emnlp.423.
  34. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. 10.1561/1500000019. URL https://doi.org/10.1561/1500000019.
  35. Reading Wikipedia to answer open-domain questions. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1171. URL https://aclanthology.org/P17-1171.
  36. Realm: retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  37. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  38. Reliable, adaptable, and attributable language models with retrieval, 2024.
  39. Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023b.
  40. Shall we pretrain autoregressive language models with retrieval? a comprehensive study. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7763–7786, Singapore, December 2023b. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.482. URL https://aclanthology.org/2023.emnlp-main.482.
  41. Instructretro: Instruction tuning post retrieval-augmented pretraining, 2024.
  42. Leveraging passage retrieval with generative models for open domain question answering. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online, April 2021. Association for Computational Linguistics. 10.18653/v1/2021.eacl-main.74. URL https://aclanthology.org/2021.eacl-main.74.
  43. Efficient nearest neighbor language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5703–5714, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.461. URL https://aclanthology.org/2021.emnlp-main.461.
  44. Adaptation approaches for nearest neighbor language models, 2023.
  45. You can’t pick your neighbors, or can you? when and how to rely on retrieval in the kNN-LM. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2997–3007, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. 10.18653/v1/2022.findings-emnlp.218. URL https://aclanthology.org/2022.findings-emnlp.218.
  46. Retrieval is accurate generation, 2024.
  47. Accelerating large language model decoding with speculative sampling, 2023.
  48. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. ArXiv, abs/2305.09781, 2023. URL https://api.semanticscholar.org/CorpusID:258740799.
  49. Accelerating llm inference with staged speculative decoding, 2023.
  50. Rest: Retrieval-based speculative decoding, 2024.
  51. FRUIT: Faithfully reflecting updated information in text. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3670–3686, Seattle, United States, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.269. URL https://aclanthology.org/2022.naacl-main.269.
  52. Peer: A collaborative language model, 2022.
  53. Rarr: Researching and revising what language models say, using language models, 2023.
  54. The faiss library. 2024.
  55. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2356–2362, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. 10.1145/3404835.3463238. URL https://doi.org/10.1145/3404835.3463238.
  56. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  57. Generating literal and implied subquestions to fact-check complex claims. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3495–3516, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. 10.18653/v1/2022.emnlp-main.229. URL https://aclanthology.org/2022.emnlp-main.229.
  58. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada, July 2023. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.228. URL https://aclanthology.org/2023.acl-long.228.
  59. Expertqa: Expert-curated questions and attributed answers, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Minghan Li (38 papers)
  2. Xilun Chen (31 papers)
  3. Ari Holtzman (39 papers)
  4. Beidi Chen (61 papers)
  5. Jimmy Lin (208 papers)
  6. Wen-tau Yih (84 papers)
  7. Xi Victoria Lin (39 papers)
Citations (4)