2000 character limit reached
NF4 Isn't Information Theoretically Optimal (and that's Good) (2306.06965v2)
Published 12 Jun 2023 in cs.LG
Abstract: This note shares some simple calculations and experiments related to absmax-based blockwise quantization, as used in Dettmers et al., 2023. Their proposed NF4 data type is said to be information theoretically optimal for representing normally distributed weights. I show that this can't quite be the case, as the distribution of the values to be quantized depends on the block-size. I attempt to apply these insights to derive an improved code based on minimizing the expected L1 reconstruction error, rather than the quantile based method. This leads to improved performance for larger quantization block sizes, while both codes perform similarly at smaller block sizes.
- JAX: composable transformations of Python+NumPy programs.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Flax: A neural network library and ecosystem for JAX.
- Haiku: Sonnet for JAX.
- Pointer sentinel mixture models.
- The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534.
- Language models are unsupervised multitask learners.
- Compressive transformers for long-range sequence modelling. arXiv preprint.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Collections
Sign up for free to add this paper to one or more collections.