Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Laws for State Dynamics in Large Language Models

Published 20 May 2025 in cs.CL and cs.AI | (2505.14892v1)

Abstract: LLMs are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.

Summary

Insights into Scaling Laws for State Dynamics in Large Language Models

The paper titled "Scaling Laws for State Dynamics in Large Language Models" offers a comprehensive exploration into how transformer-based architectures model sequential state dynamics. It extends the concept of Large Language Models (LLMs) being world models, focusing on their capacity to accurately predict state transitions based on learned latent representations. This research is motivated by the necessity to understand and enhance the mechanistic interpretability and scalability of LLMs in capturing dynamic state information, particularly for tasks involving decision-making in complex environments.

Methodological Approach

The authors investigate three specific domains: Box Tracking, Abstract DFA (Deterministic Finite Automaton) Sequences, and Complex Text Games. These tasks range from simple state updates to combinatorial reasoning challenges. Each domain is constructed to evaluate the residual stream's ability in transformers to maintain and update internal representations of states as constraints and complexity scale. Through the manipulation of the number of states and transition densities, the authors systematically study the LLM’s capability to track and adapt state information.

Significant attention is given to evaluating transformer models at varying parameter scales, such as TinyStories, GPT-2, and Pythia. The paper leverages experimental setups to measure representational capacity, attribution of state-tracking components using techniques such as activation patching, and the mechanisms enabling state tracking, particularly focusing on attention head patterns.

Key Findings

Box Tracking

The research demonstrates that larger models exhibit enhanced accuracy in tasks like Box Tracking, a result reflecting improved capacity for entity state tracking. Within models such as GPT-2, the progression in accuracy correlates positively with increases in model size, although challenges remain with escalating task complexities. Activation patching results highlight certain layers and attention heads as pivotal for correctly identifying and updating object states, revealing deeper insights into how transformers manage dynamic state representations.

Abstract DFA Sequences

In contrast to Box Tracking, the Abstract DFA task reveals limitations in state-action reasoning under increasing state complexity but smaller transition densities. The accuracy patterns imply that larger models exhibit more stable scale and resilience, but still encounter difficulties with large-state, low-transition scenarios. Findings from sequence patching suggest potential mechanisms underlying state-action tracking, with certain attention heads being identified as crucial for processing state information across sequences.

Complex Text Games

For the most linguistically and conceptually demanding tasks like Complex Text Games, accuracy declines markedly, denoting significant challenges for current LLMs in handling intricate combinatorial state updates. Despite increased model size aiding performance, linguistic variability and contextual complexity make this domain particularly challenging, indicating the necessity for further research and development in this area. The mechanistic analysis suggests a reliance on certain attention heads and residual stream patches that contribute to state-tracking success, albeit inconsistently across various task complexities.

Implications and Future Directions

This paper provides valuable insights into the limits and capabilities of LLMs in tracking state dynamics, highlighting aspects of model architecture that require attention for improvement. The identification of specific attention heads responsible for state propagation suggests avenues for targeted interventions and optimization. Future research may focus on enhancing Name-Mover Heads, improving path patching techniques, and exploring more robust benchmarks to better understand and augment these models’ decision-making abilities.

Overall, these findings imply a pathway toward progressing autoregressive transformer tools beyond mere pattern matching—toward advanced concept reasoning and incorporation of causal state dynamics. The practical implications span AI fields requiring dynamic state awareness, including planning, simulation, and autonomous agent modeling, while the theoretical contributions pave the way for deeper exploration into model interpretability and architectural innovations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.