Dodo: Dynamic Contextual Compression for Decoder-only LMs (2310.02409v2)
Abstract: Transformer-based LMs are inefficient in long contexts. We propose Dodo, a solution for context compression. Instead of one vector per token in a standard transformer model, Dodo represents text with a dynamic number of hidden states at each layer, reducing the cost of self-attention to a fraction of typical time and space. Moreover, off-the-shelf models such as LLaMA can be adapted to Dodo by efficient parameter tuning methods such as LoRA. In use, Dodo can act as either an autoregressive LM or a context compressor for downstream tasks. We demonstrate through experiments in LLMing, question answering, and summarization that Dodo retains capabilities in these tasks, while drastically reducing the overhead during decoding. For example, in the autoencoding task, Dodo shrinks context at a 20x compression ratio with a BLEU score of 98% for reconstruction, achieving nearly lossless encoding.
- CoLT5: Faster Long-Range Transformers with Conditional Computation, 2023.
- Anthropic. Claude 2, 2023. URL https://www.anthropic.com/index/claude-2.
- Longformer: The Long-Document Transformer, 2020.
- Unlimiformer: Long-Range Transformers with Unlimited Length Input, 2023.
- Recurrent Memory Transformer. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Adapting Language Models to Compress Contexts, 2023.
- Rethinking Attention with Performers. In International Conference on Learning Representations (ICLR), 2021.
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. In Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
- LongNet: Scaling Transformers to 1,000,000,000 Tokens, 2023.
- William A Falcon and The PyTorch Lightning team. Pytorch lightning, 2019.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling, 2020.
- In-context Autoencoder for Context Compression in a Large Language Model, 2023.
- REALM: Retrieval-Augmented Language Model Pre-Training. In International Conference on Machine Learning (ICML), 2020.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022.
- Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Annual Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In International Conference on Machine Learning (ICML), 2020.
- ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths. In International Conference on Learning Representations (ICLR), 2023.
- Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations (ICLR), 2020.
- Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2015.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020a.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Conference on Neural Information Processing Systems (NeurIPS), 2020b.
- Datasets: A Community Library for Natural Language Processing. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
- How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
- Lost in the Middle: How Language Models Use Long Contexts, 2023.
- SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations (ICLR), 2017.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Pointer Sentinel Mixture Models, 2016.
- Learning to Compress Prompts with Gist Tokens, 2023.
- BLEU: A method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2002.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, 2022.
- Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
- Nugget: Neural Agglomerative Embeddings of Text. In International Conference on Machine Learning (ICML), 2023.
- The NLP Task Effectiveness of Long-Range Transformers. In Annual Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023.
- Compressive Transformers for Long-Range Sequence Modelling. In International Conference on Learning Representations (ICLR), 2020.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
- Anatomy of Catastrophic Forgetting: Hidden Representation and Task Semantics. In International Conference on Learning Representations (ICLR), 2021.
- DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
- Get to the point: Summarization with pointer-generator networks. In Annual Meeting of the Association for Computational Linguistics (ACL), 2017.
- RoFormer: Enhanced Transformer with Rotary Position Embedding, 2022.
- Efficient Transformers: A Survey. ACM Computing Surveys, 55(6):1–28, 2022.
- LLaMA: Open and Efficient Foundation Language Models, 2023a.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023b.
- Attention Is All You Need. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Big Bird: Transformers for Longer Sequences. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
- GLM-130B: An Open Bilingual Pre-trained Model, 2022.
- Why Does ChatGPT Fall Short in Providing Truthful Answers?, 2023.