Fractal Patterns May Illuminate the Success of Next-Token Prediction (2402.01825v2)
Abstract: We study the fractal structure of language, aiming to provide a precise formalism for quantifying properties that may have been previously suspected but not formally shown. We establish that language is: (1) self-similar, exhibiting complexities at all levels of granularity, with no particular characteristic context length, and (2) long-range dependent (LRD), with a Hurst parameter of approximately H=0.7. Based on these findings, we argue that short-term patterns/dependencies in language, such as in paragraphs, mirror the patterns/dependencies over larger scopes, like entire documents. This may shed some light on how next-token prediction can capture the structure of text across multiple levels of granularity, from words and clauses to broader contexts and intents. In addition, we carry out an extensive analysis across different domains and architectures, showing that fractal parameters are robust. Finally, we demonstrate that the tiny variations in fractal parameters seen across LLMs improve upon perplexity-based bits-per-byte (BPB) in predicting their downstream performance. We hope these findings offer a fresh perspective on language and the mechanisms underlying the success of LLMs.
- Wavelets, spectrum analysis and 1/f processes. Wavelets and statistics, pp. 15–29, 1995.
- Alabdulmohsin, I. M. Summability calculus: A comprehensive theory of fractional finite sums. Springer, 2018.
- On the origin of long-range correlations in texts. Proceedings of the National Academy of Sciences, 109(29):11582–11587, 2012.
- Andres, J. On de Saussure’s principle of linearity and visualization of language structures. Glottotheory, 2(2):1–14, 2009.
- Gemini: A family of highly capable multimodal models. arXiv:2312.11805v1 [cs.CL], 2023a.
- PaLM 2 technical report. arXiv:2305.10403v3 [cs.CL], 2023b.
- Apostol, T. M. An elementary view of Euler’s summation formula. The American Mathematical Monthly, 106(5):409–418, 1999.
- Aref, S. Hurst phenomenon and fractal dimensions in long-term yield data. In Conference on Applied Statistics in Agriculture, 1998.
- Ausloos, M. Generalized Hurst exponent and multifractal function of original and translated texts mapped into frequency and length time series. Physical Review E, 86(3):031108, 2012.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Web 1T 5-gram Version 1, 2006. URL https://catalog.ldc.upenn.edu/LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium.
- Sparks of artificial general intelligence: Early experiments with GPT-4, 2023.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv:2210.11416v5 [cs.LG], 2022.
- Training verifiers to solve math word problems. arXiv:2110.14168v2 [cs.LG], 2021.
- Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
- Explaining world wide web traffic self-similarity. Technical report, Boston University Computer Science Department, 1995.
- An introduction to the bootstrap. CRC press, 1994.
- Eftekhari, A. Fractal geometry of texts: An initial application to the works of Shakespeare. Journal of Quantitative Linguistics, 13(2-3):177–193, 2006. doi: 10.1080/09296170600850106.
- An introduction to the theory of self-similar stochastic processes. International journal of modern physics B, 14(12n13):1399–1420, 2000.
- Feller, W. The Asymptotic Distribution of the Range of Sums of Independent Random Variables. The Annals of Mathematical Statistics, 22(3):427 – 432, 1951. doi: 10.1214/aoms/1177729589. URL https://doi.org/10.1214/aoms/1177729589.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027v1 [cs.CL], 2020.
- The estimation and application of long memory time series models. Journal of time series analysis, 4(4):221–238, 1983.
- Stochastic models that separate fractal dimension and the Hurst effect. SIAM Review, 46(2):269–282, 2004. doi: 10.1137/s0036144501394387.
- Fractal dynamics in physiology: alterations with disease and aging. Proceedings of the national academy of sciences, 99(suppl_1):2466–2472, 2002.
- On calibration of modern neural networks. In ICML. PMLR, 2017.
- Heaps, H. S. Information retrieval, computational and theoretical aspects. Academic Press, 1978.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Hurst, H. E. Long-term storage capacity of reservoirs. Transactions of the American society of civil engineers, 116(1):770–799, 1951.
- A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63(7):67–78, 2020.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- The psychology and neuroscience of curiosity. Neuron, 88(3):449–460, 2015.
- Complexity and human writings. Complexity, 7:1–6, 2000.
- Kolmogorov, A. N. Wienersche spiralen und einige andere interessante kurven in hilbertscen raum, cr (doklady). Acad. Sci. URSS (NS), 26:115–118, 1940.
- High time-resolution measurement and analysis of LAN traffic: Implications for LAN interconnection. In IEEE INFCOM, 1991.
- On the self-similar nature of Ethernet traffic. IEEE/ACM Transactions on networking, 2(1):1–15, 1994.
- The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Mandelbrot, B. How long is the coast of Britain? Statistical self-similarity and fractional dimension. science, 156(3775):636–638, 1967.
- Mandelbrot, B. Gaussian self-affinity and fractals: globality, the earth, 1/f noise, and R/S. Springer Science and Business Media, 2002.
- Mandelbrot, B. B. The fractal geometry of nature. WH freeman New York, 1982.
- Noah, Joseph, and operational hydrology. Water resources research, 4(5):909–918, 1968.
- Long-range fractal correlations in literary corpora. Fractals, 10(04):451–461, 2002.
- The fractal patterns of words in a text: a method for automatic keyword extraction. PloS one, 10(6):e0130617, 2015.
- OpenAI. GPT-4 technical report. arXiv:2303.08774v4 [cs.CL], 2023.
- Wide area traffic: the failure of Poisson modeling. IEEE/ACM Transactions on networking, 3(3):226–244, 1995.
- Long-range correlations in nucleotide sequences. Nature, 356(6365):168–170, 1992.
- Fractal analysis of time-series data sets: Methods and challenges. In Ouadfeul, S.-A. (ed.), Fractal Analysis, chapter 2. IntechOpen, Rijeka, 2018. doi: 10.5772/intechopen.81958. URL https://doi.org/10.5772/intechopen.81958.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683v4 [cs.LG], 2019.
- Scaling up models and data with t5x and seqio, 2022. URL https://arxiv.org/abs/2203.17189.
- Long range correlations in DNA: scaling properties and charge transfer efficiency. Physical review letters, 91(22):228101, 2003.
- Samorodnitsky, G. Long memory and self-similar processes. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 15, pp. 107–123, 2006.
- Long range correlation in human writings. Fractals, 1(01):47–57, 1993.
- Shannon, C. E. Prediction and entropy of printed English. Bell system technical journal, 30(1):50–64, 1951.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv:2210.09261v1 [cs.CL], 2022.
- Watkins, N. Mandelbrot’s stochastic time series models. Earth and Space Science, 6(11):2044–2056, 2019.
- Self-similarity in high-speed packet traffic: analysis and modeling of Ethernet traffic measurements. Statistical science, pp. 67–85, 1995.
- Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Transactions on networking, 5(1):71–86, 1997.