Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs (2511.01202v1)

Published 3 Nov 2025 in cs.IT, cs.AI, and math.IT

Abstract: LLMs have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.

Summary

The paper’s main contribution is a novel semantic framework that shifts focus from bits to tokens, offering an information-theoretic perspective for LLMs.
It introduces a comprehensive probabilistic model employing directed rate-distortion functions and token-level semantic embeddings to guide prediction.
The study leverages variational inference to derive performance and generalization bounds, setting a principled direction for future LLM development.

Semantic Information Theory for LLMs

"Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs" introduces a ground-breaking theoretical framework for understanding LLMs through the lens of semantic information theory. This framework shifts focus from bits to tokens, considering the latter as fundamental units of information, aiming to explain LLM principles and operations in a coherent information-theoretic context.

Probabilistic Model and Directed Information

The paper proposes a comprehensive probabilistic model for LLMs, defining the semantic-driven transition probabilities that govern token prediction. Unlike traditional channel capacity metrics, this framework employs a directed rate-distortion function to model information transmission in pre-training, emphasizing semantic relevance rather than exact bitwise recovery. This metric is hypothesized to ensure generated outputs align with intended semantic outcomes, extending Shannon's classical theory into the field of semantic interpretation.

Token-Level Semantic Embedding

Token-level semantic embedding is formalized within a probabilistic semantic vector space, representing tokens as points on a high-dimensional unit sphere. The work acknowledges computational constraints, advocating for dimensional reduction while preserving semantic coherence through optimized vector representation methods like the Gromov-Wasserstein metric and Johnson-Lindenstrauss lemma. The framework elucidates how LLMs can efficiently encode semantic information, drawing parallels to established methods in signal processing.

Autoregression LLMs (AR-LLMs)

The paper proposes a general definition for autoregression LLMs (AR-LLMs) based on time-varying vector autoregression (TV-VAR) processes. The Transformer architecture is highlighted as a special case of AR-LLM, where the attention mechanism is theoretically modeled as a bilinear form governing semantic relevance. The explicit mathematical decomposition into weights and bilinear forms demystifies Transformer dynamics, embedding it within a broader class of AR-LLM structures.

Performance and Generalization

Variational inference is leveraged to derive the evidence lower bound (ELBO) for the Transformer, offering a rigorous lens to analyze its pre-training and generalization potential. The derived ELBO demonstrates equivalency between pre-training objectives and variational optimization. Furthermore, the paper discusses generalization error bounds using statistical learning theories, emphasizing the importance of minimizing logits quantization impact during model deployment.

Implications and Future Research

This theoretical foundation holds significant implications for LLM development, promoting a paradigm shift from empirical trial-and-error approaches to principled information-theoretic methodologies. With an emphasis on semantic alignment and optimal information flow, the framework not only provides explanatory power for existing LLM capabilities but also sets the stage for exploring advanced reasoning capabilities beyond Granger causality.

Conclusion

In conclusion, this research extends information theory into semantically rich domains, offering a cohesive framework for understanding and advancing the state-of-the-art in LLMs. By bridging semantic interpretation with technical formalisms, it provides both a robust theoretical foundation and practical insights for optimizing and innovating future artificial intelligence language systems.