State-Action Occupancy Measure
- SAOM is defined as the expected (discounted) visitation frequency over state-action pairs under a policy, encapsulating long-term behavior in MDPs and RL.
- It provides a convex analytic framework for policy evaluation and optimization across classical, non-Markovian, and general-utility reinforcement learning settings.
- SAOM underpins advanced methods such as gradient estimation, function approximation, and exploration diagnostics using metrics like ESL and OMR.
The state-action occupancy measure (SAOM) is a foundational object in the analysis and algorithmic design of Markov decision processes (MDPs) and reinforcement learning (RL). Defined as the expected (possibly discounted) visitation measure over state-action pairs under a policy, the SAOM enables convex-analytic, geometric, and probabilistic analyses of long-term behaviors, policy evaluation, and policy optimization across classical MDPs, general-utility RL frameworks, non-Markovian policy settings, and recent advances in RL for preference optimization and large-scale function approximation.
1. Formal Definition and Characterizations
The SAOM quantifies the (discounted or undiscounted) expected visitation frequency of each state-action pair under a policy. In a -discounted MDP with stationary policy , the occupancy measure is
or more generally,
for absorbing MDPs, where is the absorption time and is any measurable set (Dufour et al., 2023).
The measure satisfies a system of linear characteristic (or "flow-balance") equations. For absorbing MDPs, the balance equation is
where is the transition kernel (Dufour et al., 2023). For the standard discounted setting,
(Barakat et al., 2023, Barakat et al., 2024).
For non-Markovian (history-dependent) policies 0, 1 is well-defined for all Borel subsets, and any such policy has a Markovian equivalent; i.e., for any 2, there exists a Markov policy 3 such that 4 (Laroche et al., 2022). In finite spaces, 5.
2. Theoretical Role in RL and MDP Analysis
The SAOM encapsulates all information relevant to long-term returns and is the sufficient statistic for any objective function that depends only on aggregated visit frequencies (e.g., expected return, constraints, pure exploration, or divergence to expert distributions) (Barakat et al., 2023, Barakat et al., 2024). In the convex-analytic treatment of absorbing MDPs, the set of occupation measures provides a complete description of the feasible space for optimization. In the general-utility RL setting, objectives can be arbitrary functions 6 of 7, including non-linear and non-additive functionals, reducing policy optimization to optimization over occupancy measures (Barakat et al., 2024, Barakat et al., 2023).
The occupation measure set satisfies compactness properties under uniform absorbency and regularity conditions. Pathological "phantom measures"—solutions to the linear equations not arising from any policy—appear exactly when nonzero invariant measures exist for the substochastic kernel outside the absorbing set. Absence of such invariant measures ensures all solutions are true occupation measures. Uniform absorbency (the hitting-time tail control) is necessary and sufficient for compactness of the set of occupation measures (Dufour et al., 2023).
3. SAOM in Policy Optimization and Advanced Objectives
In modern RL with general utilities, policy optimization is formulated as
8
where 9 may encode imitation (e.g., divergence to expert), risk preferences, exploration bonuses, or safety constraints (Barakat et al., 2024, Barakat et al., 2023).
Gradient-based methods employ the relationship
0
with 1 estimated via likelihood ratio and 2 computed for the chosen utility. In large state-action spaces, 3 is estimated with function approximation (e.g., linear in features or parameterized densities). Maximum likelihood estimation (MLE) of the marginal state distribution, using samples from 4, ensures high-fidelity occupancy approximation with sample complexity scaling in the feature dimension, not the state-action space size (Barakat et al., 2024).
Variance reduction, normalization, and multi-step or feature-based estimators are employed to reduce the stochastic error in occupancy and gradient estimates, as in single-loop N-VR-PG and PG-OMA algorithms. These approaches achieve strong sample complexity guarantees for first-order optimality and, under concavity and sufficient policy representation, for global optimality (Barakat et al., 2024, Barakat et al., 2023).
4. Geometry, Exploration, and Optimal Transport Analyses
The SAOM induces a geometric structure on the policy space, enabling the use of metrics such as Wasserstein distance for trajectory analysis (Nkhumise et al., 2024). In this framework, the sequence of policies generated by an RL algorithm traces a trajectory in the space of occupation measures. Quantitative exploration diagnostics include:
- Effort of Sequential Learning (ESL):
5
measuring the path length in SAOM-space relative to the geodesic (Nkhumise et al., 2024).
- Optimal Movement Ratio (OMR):
6
quantifying the fraction of trajectory movement that directly reduces regret (Nkhumise et al., 2024).
These metrics offer insight into exploration efficiency, learning "wandering," and policy improvement, and their empirical behavior correlates with algorithmic choices and task complexity.
5. Advanced Uses: Preference Learning, Intrinsic Motivation, and Non-Markovian Policies
In preference-based and multi-turn RL settings, e.g., direct preference optimization (DPO and DMPO), SAOMs replace per-state policy constraints by global occupancy-based KL constraints, yielding closed-form optima where the partition function is state-independent and cancels across trajectory comparisons. This eliminates length bias in Bradley-Terry models for trajectory preferences. The occupancy-form KL also provides robustness to compounding error in behavioral cloning and incorporates expert-like coverage globally (Shi et al., 2024).
In intrinsic motivation and occupancy maximization frameworks, the agent seeks to maximize path or SAOM entropy, producing rich, exploratory, and goal-directed behaviors without external rewards. The Bellman-style equations tie the entropy-rewarded value to the expected visitation distribution, and computational approaches (e.g., entropic value iteration) yield convergence to occupancy-maximizing policies (Ramírez-Ruiz et al., 2022).
A key theoretical result is that for any history-dependent (non-Markovian) policy, there exists a memoryless Markovian policy with identical occupancy. Thus, all SAOM-based theorems applicable to stationary policies extend directly to general policies, streamlining analysis of replay, off-policy learning, and blended data-generation processes (Laroche et al., 2022).
6. Practical Implications and Pathologies
Occupation measure-based representations form the basis for scalable RL algorithms, especially when the Bellman structure breaks in general-utility or non-linear scenarios. Compactness results guarantee the feasibility and convergence of LP-based, gradient-based, and mixture-model-based methods; phantom measures are precluded by the absence of nontrivial invariant components (Dufour et al., 2023).
Estimation of SAOMs is central in both tabular and function-approximation regimes. Empirical studies confirm that sample complexity and approximation accuracy depend on the dimension of the function approximator and the quality of the underlying occupancy estimation procedure, not directly on the size of the underlying state-action space (Barakat et al., 2024, Barakat et al., 2023).
Analysis of pathologies demonstrates that without tight control (e.g., via uniform absorbency or discounting), families of occupation measures can escape compactness or include spurious non-policy distributions. The relevant compactness and absence-of-phantoms criteria are now precisely characterized (Dufour et al., 2023).
7. Summary Table: Key Definitions and Properties
| Concept | Formal Definition / Equation | Reference |
|---|---|---|
| Discounted Occupancy (SAOM) | 7 | (Barakat et al., 2024) |
| Absorbing MDP Occupation Measure | 8 | (Dufour et al., 2023) |
| Occupancy-Measure Balance Equation | 9 | (Dufour et al., 2023) |
| Markovian Equivalence for Any Policy | 0; 1 | (Laroche et al., 2022) |
| ESL (Effort of Sequential Learning) | 2 | (Nkhumise et al., 2024) |
| Occupancy-based Preference-Optimized KL | 3, yielding global norm 4 | (Shi et al., 2024) |
| Function Approximation-based SAOM Estimation | 5, 6 via MLE on 7 samples | (Barakat et al., 2024) |
References
- "Absorbing Markov Decision Processes" (Dufour et al., 2023)
- "How Does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories" (Nkhumise et al., 2024)
- "Non-Markovian policies occupancy measures" (Laroche et al., 2022)
- "Complex behavior from intrinsic motivation to occupy action-state path space" (Ramírez-Ruiz et al., 2022)
- "Towards Scalable General Utility Reinforcement Learning: Occupancy Approximation, Sample Complexity and Global Optimality" (Barakat et al., 2024)
- "Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space" (Barakat et al., 2023)
- "Direct Multi-Turn Preference Optimization for Language Agents" (Shi et al., 2024)