Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Fractional Policy Gradients (FPG)

Updated 2 July 2025

Fractional Policy Gradients (FPG) are a reinforcement learning framework that integrates Caputo fractional derivatives to capture long-term temporal dependencies.
The framework enhances learning stability through power-law memory kernels that significantly reduce variance and improve sample efficiency.
FPG’s recursive, constant-time update enables practical deployment in non-Markovian environments with delayed rewards and complex temporal structures.

Fractional Policy Gradients (FPG) are a reinforcement learning (RL) framework that introduces fractional calculus techniques—specifically Caputo fractional derivatives—into policy gradient algorithms to model long-term temporal dependencies in credit assignment. By replacing classical Markovian temporal-difference mechanisms with operators that induce power-law memory, FPGs fundamentally alter the temporal structure of value estimation and learning updates, leading to substantial improvements in variance control, sample efficiency, and computational tractability in non-Markovian or delayed-reward environments.

1. Theoretical Foundations and Motivation

Standard policy gradient algorithms, including REINFORCE, Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG), operate under the Markovian assumption. Their update rules decay the influence of past actions and states exponentially, rapidly attenuating the effect of early decisions in long-horizon tasks. This architectural limitation leads to high variance in gradient estimates, poor sample efficiency, and unstable learning in domains with significant delayed rewards or extended temporal dependencies.

Fractional calculus generalizes derivatives and integrals to non-integer (fractional) orders, introducing memory kernels with power-law decay. Unlike exponential discounting, power-law kernels maintain a persistent, slowly fading influence of past states and actions, which is theoretically well-matched to RL domains where long-term credit assignment is essential.

FPG replaces the conventional temporal-difference operator with the Caputo fractional derivative, leading to a new class of estimators and value function updates that mathematically encode long-memory processes directly into policy optimization.

2. Fractional Gradient Formulation and Algorithmic Structure

Caputo Fractional Derivative in RL

For a function $f(t)$ , the Caputo derivative of order $\alpha \in (0,1)$ is given by

${}_{0}^{C}D_{t}^{\alpha}f(t) = \frac{1}{\Gamma(1-\alpha)} \int_{0}^{t} (t-\tau)^{-\alpha} f'(\tau) \, d\tau,$

where $\Gamma(\cdot)$ is the Gamma function. This operation computes a weighted integral over the history of $f$ , using a power-law kernel $(t-\tau)^{-\alpha}$ , as opposed to the exponential kernel of conventional RL.

In FPG, this approach results in fractional BeLLMan equations and policy gradient estimators of the form: $V^{\pi}(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k \psi_k^{(\alpha)} r_{t+k+1}\ \Big|\ s_t=s \right],$ where $\psi_k^{(\alpha)} = \frac{\Gamma(k+\alpha)}{\Gamma(\alpha)\Gamma(k+1)}$ are fractional weights that define a power-law influence on returns.

Discrete-Time Implementation and Recursion

While the Caputo derivative is defined in continuous time, FPG employs a discrete approximation using the Grünwald-Letnikov scheme, which allows for an exact recursive computation: ${}_{0}^{C}D_{t}^{\alpha}f(t) \big|_{t=nh} \approx h^{-\alpha} \sum_{k=0}^{n} \omega_k^{(\alpha)} f(nh - kh) + \mathcal{O}(h^p),$ with weights updated as

$\omega_0^{(\alpha)} = 1,\quad \omega_k^{(\alpha)} = \omega_{k-1}^{(\alpha)} \left(1 - \frac{\alpha + 1}{k}\right).$

This produces a history-dependent update that preserves all power-law temporal dependencies in the value estimate.

Crucially, FPG develops a recursive, constant-time computation for the fractional temporal-difference error: $\delta^\alpha_t = \eta^{(\alpha)} \delta_t + \mu_t^{(\alpha)} \delta^\alpha_{t-1} + \varepsilon_t,$ where $\delta_t$ is the standard TD-error, $\eta^{(\alpha)} = \Gamma(1-\alpha)^{-1}$ , $\mu_t^{(\alpha)}$ is an adaptively computed weight, and $\varepsilon_t$ is a bounded error term decaying as $t^{-\alpha-1}$ . This recursion ensures that both time and memory footprint per step are $\mathcal{O}(1)$ , removing the potential computational bottleneck of fractional memory.

3. Statistical Guarantees and Convergence Properties

FPG’s utilization of fractional operators yields key theoretical improvements over classical policy gradients:

Variance Reduction: The asymptotic variance of the FPG estimator diminishes as $\mathcal{O}(t^{-\alpha})$ , where $t$ is the number of iterations. The precise bound is

$\frac{\mathrm{Var}(G_t)}{\mathrm{Var}(G_t^{\text{std}})} \leq C \cdot t^{-\alpha} + \mathcal{O}(t^{-\alpha-1}),$

where $C$ is a constant depending on the learning problem, and $G_t^{\text{std}}$ denotes the standard policy gradient estimator.

Convergence: FPG retains almost-sure convergence properties of standard stochastic policy gradient algorithms:

$\lim_{t \to \infty} \| \nabla J(\theta_t) \|_2 = 0 \quad \text{a.s.}$

The persistence of historical influence does not impede convergence, as both bias and variance introduced by long-memory terms decay sufficiently quickly.

4. Empirical Performance and Sample Efficiency

FPG was empirically evaluated on continuous-control RL benchmarks including CartPole, MountainCar, Pendulum, and Hopper, with comparisons to leading policy gradient methods (REINFORCE, A2C, PPO, TRPO, DDPG).

Key results:

Sample Efficiency: FPG achieved 35–68% reductions in the number of episodes needed to reach predefined performance thresholds compared to PPO, TRPO, and DDPG. For example, on MountainCar, FPG required 521 episodes versus PPO's 1085, a 52% saving.
Gradient Variance: FPG exhibited 24–52% variance reduction in policy gradient estimates relative to PPO and A2C, matching the $\mathcal{O}(t^{-\alpha})$ theoretical prediction.
Computational Cost: The per-timestep computation of the FPG recursion was demonstrated to be orders of magnitude faster than full-history or finite impulse response (FIR) filtering, without accuracy loss.
Ablation studies: Removing the recursive update or adaptive clipping led to significant performance deterioration, indicating their necessity for the observed stability and gains.

All improvements were statistically significant (with $p < 10^{-6}$ in hypothesis testing).

5. Connections to Broader RL Methodologies

FPG generalizes beyond the standard exponential memory paradigm, making it highly relevant for RL settings with partial observability, delayed rewards, or long-range dependence. It bridges techniques from fractional dynamics in physics and engineering with machine learning, offering a mathematically grounded path toward more adaptive temporal modeling.

The framework is compatible with deep learning architectures and can be integrated with modern RL infrastructure, potentially complementing methods such as transformer world models or factored policy gradients when complex temporal or structural dependencies exist.

6. Practical and Algorithmic Considerations

Implementation of FPG requires:

Computing the Caputo fractional derivative using its recursive discrete formulation.
Maintaining only the current TD-error and the previous fractional TD-error per trajectory, ensuring $\mathcal{O}(1)$ state.
Careful numerical stabilization (e.g., via logarithmic tracking of the recursion coefficients) and adaptive error clipping for robust operation over long episodes.

No significant additional computational resources are needed compared to standard policy gradient algorithms, and FPG is robust to large-scale, high-dimensional RL scenarios.

Table: Summary of Fractional Policy Gradient Contributions

Aspect	FPG Innovation
Memory Mechanism	Power-law (fractional) memory using Caputo derivatives
Temporal Credit Assignment	Long-range, non-exponential, mathematically grounded
Core Computation	Recursive $\mathcal{O}(1)$ update for fractional TD-error
Theoretical Guarantees	Asymptotic variance reduction $\mathcal{O}(t^{-\alpha})$ , convergence
Efficiency	35–68% sample savings, 24–52% variance reduction, constant-time update
Practical Stability	Adaptive stabilization and numerically robust recursion

7. Implications, Applications, and Future Directions

FPG constitutes a principled and computationally efficient approach for tackling RL problems where the Markovian assumption fails or where intricate long-term dependencies must be captured. Domains benefiting from FPG include:

Robotics: extended temporal credit assignment and delayed-reward manipulation.
Healthcare: clinical protocols with delayed multi-stage feedback.
Resource/process management: tasks with consequences unfolding over long timescales.
General RL scenarios with partial observability or latent process memory.

Prospective research directions include learning or adapting the fractional order $\alpha$ to environment specifics, scaling to high-dimensional and partially observable settings, and integrating FPG with deep memory or transformer-based architectures. Combining FPG with other variance reduction approaches or in model-based RL regimes is also plausible for further gains.

References: The information derives from "Fractional Policy Gradients: Reinforcement Learning with Long-Term Memory" (2507.00073), and encompasses mathematical foundations, statistical analysis, computational characteristics, and empirical evaluation.

PDF Markdown Chat (Upgrade)

References (1)

Fractional Policy Gradients: Reinforcement Learning with Long-Term Memory (2025)