CLLMs: Consistency Large Language Models
Abstract: Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Slicegpt: Compress large language models by deleting rows and columns, 2024.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Generative ai for math: Abel. URL https://github.com/GAIR-NLP/abel.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626, 2023.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Eagle: Speculative sampling requires rethinking feature uncertainty, 2024.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Online speculative decoding, 2023.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Iterative solution of nonlinear equations in several variables. SIAM, 2000.
- Meta-kd: A meta knowledge distillation framework for language model compression across domains. arXiv preprint arXiv:2012.01266, 2020.
- Sentence repetition: What does the task measure? International Journal of Language & Communication Disorders, 50(1):106–118, 2015.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Accelerating transformer inference for translation via parallel decoding. arXiv preprint arXiv:2305.10427, 2023.
- Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Smadja, F. From n-grams to collocations: An evaluation of xtract. In 29th Annual Meeting of the Association for Computational Linguistics, pp. 279–284, 1991.
- Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
- Accelerating feedforward computation via parallel nonlinear equation solving. In International Conference on Machine Learning, pp. 9791–9800. PMLR, 2021a.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=PxTIG12RRHS.
- Consistency models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408, 2022.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.