Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling (2503.04725v1)

Published 6 Mar 2025 in cs.CL, cs.AI, cs.IT, cs.LG, math.IT, and physics.data-an

Abstract: We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context LLMing. Using this scaling law, we formulate the Long-context LLMing (L$2$M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of LLMs toward longer context lengths.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhuo Chen (319 papers)
  2. Oriol Mayné i Comas (1 paper)
  3. Zhuotao Jin (1 paper)
  4. Di Luo (63 papers)
  5. Marin Soljačić (141 papers)