Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Duncode Characters Shorter (2307.05414v1)

Published 11 Jul 2023 in cs.CL, cs.DB, and cs.IR

Abstract: This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at \url{https://github.com/laohur/duncode}. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at \url{https://github.com/laohur/wiki2txt}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Ross Arnold and Tim Bell. 1997. A corpus for the evaluation of lossless compression algorithms. Proceedings DCC ’97. Data Compression Conference, pages 201–210.
  2. Steve Atkin and Ryan Stansifer. 2003. Unicode compression: Does size really matter?
  3. Modeling for text compression. ACM Comput. Surv., 21:557–591.
  4. John G. Cleary and Ian H. Witten. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., 32:396–402.
  5. The Unicode Consortium. 2020. The Unicode Standard, Version 13.0.0. Addison-Wesley Longman Publishing Co., Inc.
  6. Mark Davis and Markus Scherer. 2016. BOCU-1: MIME-compatible Unicode compression. The Unicode Consortium, Tech. Rep. UTN 6.
  7. Character model for the world wide web.
  8. Doug Ewell. 2004. A survey of unicode compression.
  9. F. Yergeau. 1998. UTF-8, a transformation format of ISO 10646. RFC, 2279:1–10.
  10. Peter M. Fenwick and Simon Brierley. 1998. Compression of unicode files. Proceedings DCC ’98 Data Compression Conference (Cat. No.98TB100225), pages 547–.
  11. Adam Gleave and Christian Steinruecken. 2017. Making compression algorithms for unicode text. In 2017 Data Compression Conference (DCC), pages 441–441.
  12. Goetz Graefe and Leonard D. Shapiro. 1991. Data compression and database performance. [Proceedings] 1991 Symposium on Applied Computing, pages 22–27.
  13. P. Hoffman and F. Yergeau. 2000. UTF-16, an encoding of iso 10646. RFC, 2781:1–14.
  14. David A. Huffman. 1952. A method for the construction of minimum-redundancy codes.
  15. Asmus Freytag Ken Whistler, Mark Davis. 2008. Unicode character encoding model.
  16. - on the costs of multilingualism in database systems. In Johann-Christoph Freytag, Peter Lockemann, Serge Abiteboul, Michael Carey, Patricia Selinger, and Andreas Heuer, editors, Proceedings 2003 VLDB Conference, pages 105–116. Morgan Kaufmann, San Francisco.
  17. Debra A. Lelewer and Daniel S. Hirschberg. 1987. Data compression. ACM Comput. Surv., 19:261–296.
  18. Matt Mahoney. a. Data compression explained.
  19. Matt Mahoney. b. Rationale for a large text compression benchmark.
  20. Udi Manber. 1997. A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inf. Syst., 15:124–136.
  21. Sarah L. Nesbeitt. 1999. Ethnologue: Languages of the world. Electronic Resources Review, 3:129–131.
  22. Standardization Administration of China. 1981. GB 2312-1980: Code of chinese graphic character set for information interchange; Primary set. China Standard Press.
  23. Daniel Pimienta. 2009. Twelve years of measuring linguistic diversity in the internet: balance and perspectives.
  24. Radu Rădescu. 2009. Transform methods used in lossless compression of text files. ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume, 12:101–115.
  25. David Salomon. 2007. Data compression - The Complete Reference, 4th Edition.
  26. Robert C Seacord. 2005. Wide-character format string vulnerabilities-robert presents strategies for handling format string vulnerabilities in c. Dr Dobb’s Journal-Software Tools for the Professional Programmer, pages 63–65.
  27. K. Simonsen. 1992. Character mnemonics and character sets.
  28. Speeding up string pattern matching by text compression: The dawn of a new era.
  29. V. G. Cerf. 1969. ASCII format for network interchange. RFC, 20:1–9.
  30. w3techs.com. Historical yearly trends in the usage statistics of character encodings for websites.
  31. Raymond Wan. 2003. Browsing and searching compressed documents.
  32. Arithmetic coding for data compression. Commun. ACM, 30:520–540.
  33. A Standard Compression Scheme for Unicode. The Unicode Consortium, Tech. Rep.UTS 6.
  34. Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23:337–343.
  35. Compression: A key for next-generation text retrieval systems. Computer, 33:37–44.

Summary

We haven't generated a summary for this paper yet.