Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Token-Level Markov Decision Process

Updated 2 October 2025
  • Token-Level MDP is a sequential decision model where discrete tokens serve as observations to construct state representations for predictive control.
  • The approach utilizes a minimum description length principle to balance complexity and reward predictiveness, automatically selecting optimal token aggregations.
  • Extensions to dynamic Bayesian networks enable the model to capture structured token dependencies in applications like NLP, robotics, and recommendation systems.

A Token-Level Markov Decision Process (MDP) is a sequential decision model where the agent's observations are discrete tokens, such as words, characters, sensor readings, or user actions, which arrive in a stream and potentially possess intrinsic structure and dependencies. The central challenge is to construct or learn a state representation derived from these tokens such that the resulting process approximates an MDP; that is, the future can be predicted from the current state and action with sufficient accuracy for reinforcement learning and optimal control. This forms the basis for a principled approach to feature extraction and state abstraction in environments with fine-grained, tokenized observations.

1. Formal Objective Criterion for State Representation

A foundational principle in deriving token-level state abstractions is the minimization of coding (or description) length, as articulated in "Feature Markov Decision Processes" (0812.4580). Each candidate state mapping ϕ:hs\phi: h \to s summarizes the history hh (composed of token, action, and reward sequences) into a state ss such that the resulting state–action–reward sequence is both succinct and predictive. The formal cost function is:

Cost(ϕh)=CL(S1:na1:n)+CL(r1:nS1:n,a1:n)\operatorname{Cost}(\phi \mid h) = CL(S_{1:n} \mid a_{1:n}) + CL(r_{1:n} \mid S_{1:n}, a_{1:n})

where CL()CL(\cdot) denotes code length and S1:nS_{1:n}, a1:na_{1:n}, r1:nr_{1:n} are the sequences of states, actions, and rewards, respectively. The optimal mapping

ϕ=argminϕCost(ϕh)\phi^* = \arg\min_\phi \operatorname{Cost}(\phi \mid h)

balances compactness against predictive fidelity: minimizing excessive state-space growth while capturing reward-relevant token dependencies.

For i.i.d. sequences over tokens, the code length is:

CL(x1:n)=nH(θ^)+i=1m12lognCL(x_{1:n}) = n H(\hat{\theta}) + \sum_{i=1}^{m'} \frac{1}{2} \log n

where H(θ^)H(\hat{\theta}) is the empirical entropy, and mm' is the number of observed symbols.

Applied to token-level streams, candidate ϕ\phi functions extract features such as the last kk tokens (n-grams), recent sensor readings, or structured aggregations, with the cost function used to systematically select the optimal token grouping that most effectively “explains” the environment's reward structure.

2. Integration Into Learning Algorithms

This objective criterion is directly integrated into the learning process, not used merely for post hoc evaluation. In the ϕ\phiMDP-Agent algorithm, the workflow operates as follows:

  • The agent collects histories of tokens, actions, rewards.
  • For each candidate ϕ\phi, states are derived and transition/reward models estimated via empirical frequency.
  • These models are used with BeLLMan backups (e.g., Q(s,a)=sT(ss,a)[R(s)+γV(s)]Q^*(s,a) = \sum_{s'} T(s'|s,a)[R(s')+\gamma V(s')]) to compute optimal actions.
  • Stochastic local search (e.g., state splitting/merging) guided by Cost(ϕh)\operatorname{Cost}(\phi|h) explores the space of ϕ\phi mappings.
  • Exploration is balanced with exploitation so that all token-derived states are eventually sufficiently sampled.

For token-level MDPs, this framework enables automatic learning of the optimal token aggregation and state representation, leveraging the minimum description length objective to adaptively refine both the state abstraction and the control policy throughout learning.

3. Extensions to Structured Models: Dynamic Bayesian Networks

In environments where tokens have strong inter-token dependencies (such as language with syntactic or semantic structure), the basic MDP framework can be extended to Dynamic Bayesian Networks (DBNs). Rather than encoding states as unstructured labels, DBNs use collections of variables with explicit graphical dependencies. The selection of DBN structure is also guided by code-length minimization, now applied to the network. This leads to more nuanced state representations capable of capturing, for example, multi-token patterns, part-of-speech sequences, or latent syntactic constructs—enabling succinct yet expressive modeling for prediction and control.

4. Key Models and Computational Formulations

Central formulas for token-level MDP abstraction include:

Formula Description
CL(x1:n)CL(x_{1:n}) Code length for i.i.d. token sequence
CL(S1:na1:n)CL(S_{1:n}|a_{1:n}) State transition coding cost
CL(r1:nS1:n,a1:n)CL(r_{1:n}|S_{1:n},a_{1:n}) Reward sequence coding cost
Cost(ϕh)\operatorname{Cost}(\phi|h) Overall description length objective
Q(s,a)Q^{*}(s,a) (BeLLMan Eq.) Optimal action value in MDP

Alternate formulations, such as coding only the rewards, are also introduced:

ICost(ϕh)=logPU(r1:na1:n)+Mlogn\operatorname{ICost}(\phi \mid h) = - \log P_U(r_{1:n} \mid a_{1:n}) + M \log n

where UU models joint transition–reward probabilities and MM denotes model complexity. Transition and reward models are learned for each ϕ\phi using transition counts and estimated conditional probabilities.

5. Representative Applications and Practical Scenarios

Token-level MDP abstraction methods apply broadly:

  • NLP: When tokens are words, automatic selection of ϕ\phi among unigram, bigram, or higher n-gram models can reveal the optimal context length for predicting downstream reward (e.g., translation accuracy, task success).
  • Robotics and Control: Sensor tokens (e.g., discretized readings) are grouped into state contexts using ϕ\phi that captures sufficient environmental patterning for reliable reward prediction (e.g., navigation or manipulation success).
  • Online Recommendation Systems: User tokens (actions/clicks) are aggregated to states that most accurately forecast reward signals (e.g., purchases, engagement metrics).

A practical example presented in the foundational work involves binary observations and quaternary rewards; mapping ϕ\phi as functions of the previous 0, 1, or 2 tokens demonstrates via cost calculations that including memory of two tokens yields superior predictive encoding for rewards.

6. Implications and Theoretical Context

The framework leverages minimum description length principles to mechanize feature extraction in reinforcement learning for sequential decision environments with fine-grained token observations. The stochastic, cost-driven search over token-to-state mapping functions delivers a systematic methodology for solving what was previously handled via domain expertise and manual feature engineering. The extension to DBNs reinforces applicability to structured environments, with the potential for further elaboration in hierarchical or relational settings.

A plausible implication is that the methodology provides a scalable route to data-driven abstraction, robust to the exponential growth of state spaces typical in token-level domains. By explicitly quantifying the tradeoff between representational cost and reward predictiveness, it drives the automatic selection of abstractions most informative for sequential decision-making.

7. Summary of Key Contributions

The token-level MDP formalism, anchored in cost-minimization for state representation, constitutes an objective paradigm for extracting, evaluating, and refining abstractions from streams of discrete tokens. The machinery—state coding costs, reward predictiveness, adaptive stochastic search, and extension to graphical models—collectively supports principled construction of MDP models in domains where observations are tokenized and potentially strongly interdependent. This unifies feature representation with optimal control theory, advancing the automation of sequential decision-making under uncertainty in high-dimensional, token-based environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Token-Level Markov Decision Process (MDP).