Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration (2405.17146v1)
Abstract: This study investigates whether Compressed-LLMs (CLMs), i.e. LLMs operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Our findings demonstrate that CLMs can effectively perform these tasks. These results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs. The possibility to directly operate on raw compressed files offers the promise to leverage some of their remarkable characteristics, such as their ubiquity, compactness, multi-modality and segment-nature.
- MS-DOC: Word (.doc) binary file format. Microsoft, 2018. Accessed: 2024-05-27.
- Context-sensitive arabic spell checker using context words and n-gram language models. In 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, pages 258–263. IEEE, 2013.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.
- T. Boutell. Png (portable network graphics) specification version 1.0. Technical report, 1997.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Spellgcn: Incorporating phonological and visual similarities into language models for chinese spelling check. arXiv preprint arXiv:2004.14166, 2020.
- Natural language processing (almost) from scratch. Journal of machine learning research, 12:2493–2537, 2011.
- P. Deutsch. Deflate compressed data format specification version 1.3. Technical report, 1996.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop, pages 146–152, 2018.
- L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Faster neural networks straight from jpeg. Advances in Neural Information Processing Systems, 31, 2018.
- Bytes are all you need: Transformers operating directly on file bytes. arXiv preprint arXiv:2306.00238, 2023.
- D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
- Spellbert: A lightweight pretrained model for chinese spelling check. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 3544–3551, 2021.
- JPEG Committee. Jpeg homepage, 2024. Accessed: 2024-05-22.
- Learning multiple layers of features from tiny images. 2009.
- Gpt2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6):1069–1073, 2013.
- Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- Training llms over neurally compressed text. arXiv preprint arXiv:2404.03626, 2024.
- Meta LLaMA. Llama recipes. https://github.com/meta-llama/llama-recipes, 2024.
- L. of Congress. Mp3 file format. Library of Congress Digital Preservation, 2024. Accessed: 2024-05-27.
- L. of Congress. Zip file format (pkware). Library of Congress Digital Preservation, 2024. Accessed: 2024-05-27.
- J. Park and J. Johnson. Rgb no more: Minimally-decoded jpeg vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22334–22346, 2023.
- JPEG: Still image data compression standard. Springer Science & Business Media, 1992.
- A. Puri and A. Eleftheriadis. Mpeg-4: An object-based multimedia coding standard supporting mobile applications. Mobile Networks and Applications, 3:5–32, 1998.
- Syntax and sensibility: Using language models to detect and correct syntax errors. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 311–322. IEEE, 2018.
- C. f. C. R. i. M. Stanford University and A. (CCRMA). Wave pcm soundfile format, 2014. Accessed: 2024-05-27.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Dct-domain deep convolutional neural networks for multiple jpeg compression classification. Signal Processing: Image Communication, 67:22–33, 2018.
- G. K. Wallace. The jpeg still picture compression standard. IEEE transactions on consumer electronics, 38(1):xviii–xxxiv, 1992.
- R. F. Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pages 1–3, 2007.
- Beyond language models: Byte models are digital world simulators. arXiv preprint arXiv:2402.19155, 2024.
- Megabyte: Predicting million-byte sequences with multiscale transformers. Advances in Neural Information Processing Systems, 36, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.