Towards Formalizing Reinforcement Learning Theory (2511.03618v1)

Published 5 Nov 2025 in cs.LG and stat.ML

Abstract: In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.

Summary

The paper introduces a formal Lean 4 framework that rigorously proves almost sure convergence for Q-learning and linear TD algorithms.
It employs modern techniques including the Robbins-Siegmund theorem and Lyapunov functions to overcome limitations of traditional ODE-based proofs.
The framework rigorously constructs probability spaces for Markovian trajectories and is extensible to other reinforcement learning methods.

Formalizing the Almost Sure Convergence of $Q$ -Learning and Linear TD in Reinforcement Learning

Introduction

This paper presents a formalization of the almost sure convergence of two foundational reinforcement learning (RL) algorithms— $Q$ -learning and linear temporal difference (TD) learning—under Markovian sampling, using the Lean 4 theorem prover and the Mathlib library. The work addresses a longstanding gap in RL theory: while convergence proofs for these algorithms are well-established in the literature, they often rely on intricate, error-prone arguments, particularly those based on the ODE method, and typically omit full measure-theoretic rigor. This formalization provides a unified, extensible framework for verifying convergence properties, leveraging modern proof techniques centered on the Robbins-Siegmund theorem and Lyapunov functions, and sets a new standard for the robustness of RL theory.

Motivation and Context

The convergence of $Q$ -learning and linear TD has been a central topic in RL theory, with numerous works addressing both finite and infinite state-action spaces. However, traditional proofs are delicate for two primary reasons:

ODE-Based Proof Fragility: The canonical approach uses stochastic approximation via ODE methods, which are detail-heavy and susceptible to subtle errors, as evidenced by errata in peer-reviewed works and even textbooks.
Measure-Theoretic Gaps: RL theory is typically developed in the MDP framework, requiring rigorous construction of probability spaces for infinite trajectories (via the Ionescu-Tulcea theorem), and careful handling of measurability and integrability. These aspects are often glossed over, especially for infinite state spaces.

Prior formalizations in proof assistants (Coq, Isabelle/HOL) have focused on dynamic programming or stochastic approximation for outdated algorithm variants, without addressing the modern forms of $Q$ -learning and linear TD with Markovian noise. This work fills that gap, providing the first formal, machine-checked convergence proofs for these algorithms in their standard forms.

Formalization Framework

Mathematical Setting

The formalization considers an infinite-horizon MDP with finite state and action spaces, reward and transition functions, and a discount factor $\gamma \in [0,1)$ . The two algorithms are:

Linear TD: For policy evaluation, with updates

$w_{t+1} = w_t + \alpha_t (R_{t+1} + \gamma x(S_{t+1})^\top w_t - x(S_t)^\top w_t) x(S_t)$

where $x(\cdot)$ is a feature map and $w_t$ the parameter vector.

$Q$ -Learning: For control, with updates

$q_{t+1}(s, a) = q_t(s, a) + \alpha_t (R_{t+1} + \gamma \max_{a'} q_t(S_{t+1}, a') - q_t(S_t, A_t)) \mathbb{I}_{(s, a) = (S_t, A_t)}$

The goal is to formally prove that, under standard assumptions (irreducibility, aperiodicity, appropriate step sizes), the iterates converge almost surely to the correct fixed points ( $w_*$ for TD, $q_*$ for $Q$ -learning).

Probability Space Construction

A key technical contribution is the explicit construction of the probability space for sample paths using the Ionescu-Tulcea theorem, as formalized in Mathlib. This enables rigorous measure-theoretic treatment of Markovian trajectories, conditional expectations, and filtrations, which are essential for the convergence analysis.

Unified Proof Strategy

The formalization eschews the ODE-based approach in favor of a modern method based on the Robbins-Siegmund theorem, Lyapunov functions, and the skeleton iterates technique. The main steps are:

Rewriting Updates: Both algorithms are cast in the form

$w_{t+1} = w_t + \alpha_t (F(w_t, Y_{t+1}) - w_t)$

where $Y_{t+1}$ encodes the relevant Markovian noise.

Lyapunov Analysis: Existence of a suitable Lyapunov function $\phi$ is established, satisfying smoothness, norm equivalence, and contraction properties.
Noise Decomposition: The update is decomposed into a main term and two noise terms: a leading martingale difference sequence and a smaller bias term.
Robbins-Siegmund Application: By verifying that the Lyapunov function along the iterates forms an almost supermartingale, the Robbins-Siegmund theorem is invoked to conclude almost sure convergence.
Skeleton Iterates for Markovian Noise: The skeleton iterates technique is used to handle the dependence structure in Markovian samples, reducing the problem to one amenable to the martingale difference analysis.

Formal Theorem Statements

The main formalized results are:

Linear TD (Markovian samples): Under irreducibility, aperiodicity, full column rank of features, and step size $\alpha_t = (t+2)^{-\nu}$ with $\nu \in (2/3, 1)$ , the iterates converge almost surely to the TD fixed point.
Linear TD (i.i.d. samples): Under the Robbins-Monro step size condition, convergence holds for i.i.d. samples.
$Q$ -Learning (Markovian samples): Under analogous conditions, the $Q$ -learning iterates converge almost surely to the optimal action-value function.

The restriction $\nu \in (2/3, 1)$ arises from the proof technique; extending to $\nu \in (1/2, 2/3]$ would require formalizing more advanced ODE-based arguments.

Implementation in Lean 4

The formalization is implemented in approximately 10,000 lines of Lean 4 code, leveraging Mathlib for measure theory, probability, and linear algebra. Key implementation aspects include:

Stochastic Matrices and Markov Chains: Classes for stochastic vectors, row-stochastic matrices, irreducibility, aperiodicity, and Doeblin minorization.
Probability Measures on Trajectories: Construction of the sample path measure via the Ionescu-Tulcea theorem.
Iterate Definitions: Encodings of the TD and $Q$ -learning updates as recursive functions over sample paths.
Conditional Expectation: Formalization of conditional expectations in the context of Markov chains, requiring substantial code to bridge the gap between abstract measure-theoretic definitions and concrete matrix computations.
Supermartingale Arguments: Formalization of the Robbins-Siegmund theorem and its application to the Lyapunov function along the iterates.

The codebase is publicly available and can serve as a high-quality dataset for benchmarking LLMs on formal reasoning and code synthesis in mathematics and machine learning.

Numerical and Theoretical Implications

The formalization confirms, with machine-checked rigor, the almost sure convergence of $Q$ -learning and linear TD under standard conditions. The framework is immediately extensible to:

Convergence Rates: By telescoping the supermartingale inequality, $\mathcal{L}_2$ convergence rates can be obtained for i.i.d. samples; for Markovian samples, established techniques can be formalized.
Other Modes of Convergence: The approach can be extended to high-probability concentration, $\mathcal{L}_p$ convergence, and exponential tail bounds, contingent on formalizing auxiliary results such as Hoeffding's lemma.
Other Algorithms: The framework is adaptable to off-policy TD methods, gradient TD methods (without eligibility traces), and, with further development, to more complex algorithms involving time-inhomogeneous Markov chains (e.g., SARSA, policy gradient methods).

A notable technical challenge is the formalization of conditional expectations in the context of Markov chains, which required significant effort and code. This highlights the complexity of bridging intuitive probabilistic reasoning and formal, machine-checked proofs.

Implications for AI and Formal Methods

This work demonstrates the feasibility and value of fully formalizing nontrivial results in RL theory, setting a precedent for future efforts in the formal verification of machine learning algorithms. The formalization provides a robust foundation for further theoretical developments and can serve as a benchmark for evaluating the capabilities of LLMs and automated theorem provers in mathematical reasoning.

The project also provides empirical evidence regarding the current limitations of LLMs in formal mathematics: while LLMs (e.g., Gemini, ChatGPT) are valuable as tutors, search engines, and for small lemma synthesis, they are not yet capable of independently completing such a formalization. This underscores the need for continued research at the intersection of AI and formal methods.

Conclusion

The formalization of the almost sure convergence of $Q$ -learning and linear TD in Lean 4 represents a significant advance in the rigor and reliability of RL theory. The developed framework is extensible to a broad class of RL algorithms and convergence properties, and the codebase provides a valuable resource for both the formal methods and machine learning communities. Future work will address more general step size regimes, more complex algorithms, and further integration with automated reasoning tools.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper takes two classic learning methods from reinforcement learning (RL)—Q‑learning and linear TD (temporal difference) learning—and writes their convergence proofs in a way that a computer can check line by line. The authors use Lean 4, a “theorem prover” that forces you to be completely precise. Their main goal is to show, with full mathematical rigor, that these algorithms really do learn the right answers over time when the data they see comes from a Markov chain (a type of random process).

What questions does the paper ask?

To make the results 100% reliable, the paper focuses on these practical questions:

Can we build a fully rigorous probability model for the data that Q‑learning and TD use (including infinite sequences of states)?
Under realistic conditions, can we prove that Q‑learning and linear TD converge almost surely (which means “with probability 1”)?
Can we do this using a modern, unified proof technique that’s friendly to formalization in Lean?
Can these formalized tools be reused to also prove rates of convergence and other types of guarantees later?

How did they do it? (Methods)

The authors combine careful mathematical modeling with a modern proof strategy that avoids older, more fragile methods.

The tools and setup

Lean 4 and Mathlib: These are like a super-strict math teacher and a big math library. Lean 4 checks every step; Mathlib provides many ready‑made parts.
Markov chains: Imagine walking through rooms where your next room depends only on your current room and a fixed rule. This “memoryless” process drives the randomness in RL.
Ionescu‑Tulcea theorem: A math result that lets you build a precise probability space for infinite sequences, like a never‑ending story of states and actions. This is needed to talk rigorously about “almost sure convergence.”
Stationary distribution: The long‑run “settled” pattern of how often you visit each state. The authors prove it exists and is unique under standard conditions (irreducible and aperiodic chains), using contraction ideas and Banach’s fixed‑point theorem.

The main proof idea (explained simply)

Think of the algorithm’s “error” as an energy meter that we want to push down to zero. A special function called a Lyapunov function plays the role of this energy meter.

Lyapunov function: A smartly chosen “energy” score (like a squared distance to the target) designed to go down when you update your estimate.
Robbins–Siegmund theorem: A result about sequences that keep decreasing on average, but have small wiggles (noise). It says that if you shrink by a little each step and the wiggles are controlled, your value will go to zero.
Martingale difference noise: Noise with no predictable bias; in plain terms, it doesn’t systematically push you up or down.
Skeleton (anchor) iterates: Instead of analyzing every single step directly (which is messy with Markovian noise), the authors look at carefully chosen “checkpoints” in time. Between checkpoints, they add up updates and organize the noise into two parts:
- One part is big but unbiased (martingale difference).
- The other part is smaller and more predictable (higher‑order).
- Using anchors makes the noise behave like the kind the Robbins–Siegmund theorem can handle.

Step sizes and conditions (why they matter)

Step size αₜ: How big an update you make at time t. Here it’s set to αₜ = 1/(t+2)^ν with ν in (2/3, 1). This choice ensures the steps get smaller, but not too fast or too slow, so learning remains stable and the math works out.
Irreducible and aperiodic Markov chains: Standard conditions meaning you can eventually reach any state (irreducibility), and you don’t get trapped in a repeating cycle (aperiodicity).

What did they find? (Results)

The paper formally proves, inside Lean, that:

Linear TD converges almost surely to its correct fixed point when data comes from a finite, irreducible, aperiodic Markov chain, using step sizes αₜ = 1/(t+2)^ν with ν in (2/3, 1).
Q‑learning converges almost surely to the optimal action‑value function q* under the same Markov chain conditions and the same step sizes.
There is also an independent-and-identically distributed (i.i.d.) version for linear TD: when the samples are independent and follow the stationary distribution, TD converges almost surely under the standard Robbins–Monro step‑size condition (the usual “sum of steps is infinite but sum of squares is finite” rule).

Why this is important:

These are cornerstone results in RL. Making them fully formal—so that a computer confirms every step—helps remove hidden gaps and errors that can creep into long, delicate proofs.
The approach is unified: the same Robbins–Siegmund framework and Lyapunov ideas apply to TD and Q‑learning, and can be extended to more results (like convergence rates and concentration bounds).

Implications and potential impact

This formalization is a solid foundation for future RL theory you can trust:

More guarantees: The same framework can be extended to prove convergence rates (how fast you learn), high‑probability bounds (how likely you are to be close to the target), and results for more advanced TD methods.
Better benchmarks for AI: The Lean code (about 10,000 lines) creates high‑quality “subgoals” that can test LLMs on precise mathematical reasoning, beyond casual math questions.
Honest view of AI help: The authors used tools like Gemini and ChatGPT to speed up learning and find lemmas, but current LLMs can’t do this full formalization alone. This shows how human‑AI collaboration can be powerful, while highlighting real limitations.

If you want to explore or reuse their formal proofs, the code is available at: https://github.com/ShangtongZhang/rl-theory-in-lean

Key ideas explained in everyday terms

Reinforcement learning: Learning by trial and error, like a robot that explores a maze and figures out better routes over time.
Q‑learning: Learns how good each action is in each state (q-values) and aims for the best possible future rewards.
TD learning: Learns the value of states by comparing predictions at successive steps (“temporal differences”).
Almost sure convergence: With probability 1, the estimates a learning algorithm makes get closer and closer to the true answer and eventually settle there.
Contraction: A rule that always pulls you closer to the target by a fixed fraction; it guarantees a unique fixed point and is great for proving convergence.
Measure‑theoretic probability: The super‑precise math behind probability. It’s the rigorous language the computer uses to avoid hand‑wavy arguments.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, concrete list of knowledge gaps, limitations, and open questions that remain unresolved and point to actionable next steps for future work:

Step-size restriction under Markovian samples: the formal proofs require α_t = (t+2)^-ν with ν ∈ (2/3, 1); extend the formalization to ν = 1 (claimed straightforward), ν ∈ (1/2, 2/3] (requires strengthening skeleton-iterate bounds or ODE methods), and ν ∈ (0, 1/2] (likely needs recent ODE techniques such as Lauand et al., 2024).
Omitted i.i.d. Q-learning formalization: the i.i.d. version allowing general Robbins-Monro step sizes is mentioned but not formally proved; provide a Lean proof for i.i.d. Q-learning analogous to the TD case.
Finite MDP limitation: all results are proved for finite state-action spaces; identify minimal assumptions (e.g., standard Borel/Polish state spaces, boundedness, continuity/measurability of kernels) and formally extend the Ionescu–Tulcea and conditional expectation machinery to infinite/continuous MDPs.
Time-inhomogeneous Markov chains: algorithms like (linear) SARSA, Q-learning with evolving behavior policy (e.g., GLIE, projected variants), and policy gradient methods are out of scope; develop a formal framework that handles chains coupled with iterates and time-varying kernels.
Off-policy TD methods: gradient TD (GTD/GTD2/TDC) and emphatic TD are not formalized; tackle full traces (requiring general state space Markov chain analysis) or provide formal proofs for truncated traces first.
Lyapunov decay for Q-learning: the proof sketch relies on a pseudo-contraction and large-p norm equivalence without explicit constants; formally pin down the admissible p, contraction constants, and norm-equivalence factors, and verify Assumption (iv) in Lean for the fixed-behavior-policy setup.
Exploration assumptions for Q-learning: irreducibility and aperiodicity of the (S_t, A_t) chain are assumed but not tied to concrete exploration schemes (e.g., ε-greedy, GLIE); provide constructive conditions on π ensuring these properties and formally prove them.
Generalized step-size schedules: only deterministic, monotone polynomial schedules are treated; extend to randomized/adaptive step sizes, counter-based step sizes, or piecewise schedules, and give anchor constructions that preserve the needed inequalities.
Nonasymptotic results: the Robbins–Siegmund formalization covers a special asymptotic case; add nonasymptotic “almost supermartingale” variants and derive explicit finite-time L2 and a.s. rates under both i.i.d. and Markovian sampling.
Concentration and Lp convergence: concentration with exponential tails and Lp convergence are stated as straightforward but not provided; formalize core inequalities (Hoeffding’s lemma, Bernstein/Freedman for martingales, mixing-based concentration) and instantiate them for TD/Q-learning.
Conditional expectation tooling: the project includes a bespoke 1,000-line conditional expectation lemma for Markov chains; factor this into reusable Lean library components that handle path-space conditional expectations for kernels (Markov and history-dependent) in a general way.
Negative definiteness and invertibility of A for linear TD: the proof relies on A being negative definite and invertible (X full column rank, ergodic chain); supply complete Lean proofs of these spectral properties under minimal boundedness/regularity assumptions and clarify when they fail.
Linear TD beyond MRP: the current formalization uses on-policy MRP rewards R_{t+1} = r_π(S_t); extend to settings where rewards depend on (S_t, A_t) and to off-policy linear TD (including known divergence cases) with clear assumptions.
Noise growth bounds: Assumption on noise terms requires bounds like ∥e_{1,n+1}∥ ≤ C α_n (1 + ∥x_n∥^2); explicitly state and formalize the needed boundedness (rewards, features, derivatives/Lipschitz constants of F/f) and verify these bounds in Lean.
Anchor construction: the skeleton-iterates proof hinges on choosing anchors t_m satisfying α_{t_m} ≤ C β_m^2; provide a general, formally verified construction for common step-size families and document the constants.
Geometric mixing and minorization beyond finite: Doeblin minorization and contraction in the simplex are proved via irreducibility/aperiodicity for finite chains; generalize these tools (and their Lean formalizations) to broader state spaces with verifiable minorization conditions.
Function approximation in Q-learning: the formalized Q-learning result is tabular; extend to linear (and other structured) function approximation (cf. Liu et al., 2025) and formally verify the associated stability/contraction conditions.
Reducible or periodic chains: the framework assumes irreducible, aperiodic chains; characterize and formalize convergence behavior in reducible/periodic settings (e.g., convergence within recurrent classes, required modifications to proofs).
Broader Robbins–Siegmund variants: only a special case with deterministic T_n and squared-summable T_n² is formalized; add more general versions (random T_n, additional error terms) frequently used in stochastic approximation.
ODE-based approach in Lean: the paper avoids ODE methods due to Mathlib limitations; outline and implement a roadmap to formalize modern ODE-based stochastic approximation (e.g., Borkar, Liu 2025) in Lean to cover step-size regimes and algorithms beyond the current Lyapunov–Robbins–Siegmund framework.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the paper’s Lean 4 formalizations, proof techniques, and codebase to improve reliability, education, tooling, and evaluation around reinforcement learning.

Lean RL Convergence Library adoption
- Sector: software/AI, robotics, finance, healthcare
- What: Use the open-source repo (rl-theory-in-lean) as a formal verification asset to check almost sure convergence of Q-learning and linear TD on finite MDPs under the paper’s assumptions.
- Tools/products/workflows: “Lean RL Theorem Library” embedded in internal verification pipelines; templates to encode environment kernels and feature matrices; CI tasks that run Lean proofs before deployment.
- Assumptions/dependencies: Finite state/action spaces; irreducible and aperiodic Markov chains; fixed behavior policy for Q-learning (no policy changes during training); linear TD feature matrix is full column rank; step sizes satisfy ν ∈ (2/3, 1) or Robbins–Monro; access to Lean 4/Mathlib.
Pre-deployment RL Convergence Auditor
- Sector: robotics, industrial automation, operations research
- What: A workflow to vet environment and hyperparameters before field deployment by a) modeling the transition kernel, b) checking irreducibility/aperiodicity, c) asserting Doeblin minorization after powers, d) verifying stationary distribution existence/geometric mixing, and e) confirming step-size schedules fit the formal guarantees.
- Tools/products/workflows: CLI/SDK that ingests an MDP spec and emits a pass/fail report; scripts to estimate mixing via simulation; anchor schedule generator for skeleton iterates.
- Assumptions/dependencies: Environment is stationary and reasonably modelable; the Markov kernel can be approximated or derived; step-size scheduling complies with the ν range or Robbins–Monro.
Safe hyperparameter tuner for RL
- Sector: ML engineering/DevOps
- What: Auto-select learning rate schedules that satisfy Robbins–Monro (i.i.d.) or ν ∈ (2/3, 1) (Markovian) and produce anchor sequences for skeleton iterates.
- Tools/products/workflows: “Proof-aware” tuner integrated with PyTorch/JAX RL libraries; alerts when hyperparameters exit provable regimes.
- Assumptions/dependencies: The RL algorithm matches the formalized variants (linear TD with MRP rewards; Q-learning with fixed behavior policy).
Formal safety-case documentation generator
- Sector: regulated industries (medical devices, finance)
- What: Generate machine-checkable Lean proof artifacts and human-readable summaries to include in safety audits, procurement, or compliance reviews.
- Tools/products/workflows: Automated report builder that links the environment model to formal assertions (stationary distribution, mixing, convergence).
- Assumptions/dependencies: Ability to abstract operational systems as finite MDPs; auditors/regulators accepting formal artifacts.
University course modules and assignments in formal RL
- Sector: academia/education
- What: Teaching materials and assignments where students encode MDPs and verify convergence formally in Lean; bridging measure-theoretic probability with RL.
- Tools/products/workflows: Lab exercises based on the repo; skeleton iterates demonstrations; conditional expectation exercises over Markov chains.
- Assumptions/dependencies: Basic Lean proficiency; finite environments; campus compute setups for Lean.
LLM benchmarking and training data for automated theorem proving
- Sector: AI development
- What: Expand the FormalML dataset with RL-theory subgoals; evaluate and train code agents and theorem-proving LLMs on measure-theoretic probability and RL proofs.
- Tools/products/workflows: Subgoal generation pipelines; evaluation harnesses for agents; inclusion in leaderboard suites.
- Assumptions/dependencies: Tactics for subgoal generation; sufficient compute; license compliance with the repo.
Research reproducibility and paper vetting
- Sector: academia
- What: Use the framework to check the convergence claims of new RL papers under their stated assumptions; begin formalizing convergence rates and concentration using Robbins–Siegmund-based techniques.
- Tools/products/workflows: “Claim Checker” that attempts to instantiate the paper’s assumptions and produce Lean proofs or pinpoint gaps.
- Assumptions/dependencies: Authors provide enough structure to encode kernels/features/step sizes; the target result aligns with Markovian/i.i.d. setups covered.
Quality assurance in simulators and synthetic environments
- Sector: software engineering, simulation
- What: Validate conditional expectations, sample-path probability spaces (via Ionescu–Tulcea), and stationarity assumptions in custom simulators.
- Tools/products/workflows: Lean-backed validations embedded in simulation build pipelines; test harnesses for conditional expectations over chain-generated spaces.
- Assumptions/dependencies: Simulator exposes kernels/filtrations; finite models; Lean integration into dev workflow.

Long-Term Applications

The following applications require further research, extended formalizations (e.g., ODE methods, general-state-space chains), or broader ecosystem development and policy alignment.

Formal RL safety certification standard
- Sector: policy/regulation, auditing
- What: Industry-wide standards requiring machine-checkable convergence proofs for RL components in safety-critical systems.
- Tools/products/workflows: “Formal RL Safety” certification; audit templates referencing Lean artifacts.
- Assumptions/dependencies: Regulator buy-in; community consensus; coverage of broader RL algorithms beyond those formalized here.
General-state-space and trace-based formalization
- Sector: academia, advanced RL
- What: Extend formal results to infinite/continuous spaces, emphatic TD with full traces, and general kernels.
- Tools/products/workflows: Mathlib expansions for general state space Markov chains and conditional expectation; tooling for trace analysis.
- Assumptions/dependencies: Significant measure-theoretic and Markov-chain formalizations; tractable assumptions for non-finite models.
ODE-based method formalization for broader step-size regimes
- Sector: robotics/control, RL research
- What: Formalize Borkar/Kushner/Liu ODE approaches to cover ν ∈ (1/2, 2/3], time-inhomogeneous chains, and more algorithms (e.g., SARSA).
- Tools/products/workflows: Mathlib ODE/control libraries; proof templates for time-varying dynamics.
- Assumptions/dependencies: Development of ODE/control theory in Mathlib; stable numerical bridges to proof assistants.
Proof-aware RL libraries
- Sector: software/ML engineering
- What: PyTorch/JAX plugins that auto-check assumptions, generate Lean proof artifacts during training, and gate deployment on verified guarantees.
- Tools/products/workflows: Unified MDP schema; Lean–Python bridge; “block on proof failure” training mode.
- Assumptions/dependencies: Robust cross-language tooling; standardized environment specs; moderate performance overhead.
Autonomous formalization agents
- Sector: AI tools
- What: Train LLM agents on the expanded FormalML dataset to autonomously complete subgoals and scale formalization of ML theory.
- Tools/products/workflows: Agent training pipelines; human-in-the-loop blueprint iteration; compute clusters for proof search.
- Assumptions/dependencies: Stronger LLMs; reliable agent orchestration; dataset growth across diverse ML theory.
Compliance tooling for procurement and vendor evaluation
- Sector: finance, healthcare, public procurement
- What: Require vendors to supply Lean-verified convergence artifacts and environment assumptions in RFPs and audits.
- Tools/products/workflows: Procurement checklists; automated validators for submitted proofs and specs.
- Assumptions/dependencies: Mature standards; accessible proof verification for non-experts.
Robust RL controllers in safety-critical systems
- Sector: energy (grid stabilization), healthcare (clinical decision support), transportation (autonomy)
- What: Design RL controllers whose convergence is backed by formal guarantees and verified mixing assumptions.
- Tools/products/workflows: Modeling pipelines to derive/estimate kernels; formal checks embedded in controller certification.
- Assumptions/dependencies: Accurate MDP abstraction; evidence for irreducibility/aperiodicity; acceptance of model mismatch bounds.
Conditional expectation and concentration libraries
- Sector: software/verification, statistics
- What: Formal toolkits for conditional expectations on chain-generated spaces, Hoeffding’s lemma, and nonasymptotic bounds.
- Tools/products/workflows: Mathlib modules for concentration; APIs for martingale/supermartingale reasoning.
- Assumptions/dependencies: Formalization of foundational inequalities; performance-conscious proof engineering.
Nonasymptotic guarantees and training-time predictors
- Sector: ML engineering
- What: Formal L2 and almost-sure convergence rates, exponential-tail concentration for i.i.d./Markovian noise; monitors that forecast training time to ε-accuracy under verified assumptions.
- Tools/products/workflows: Analytics dashboards that translate proof bounds into operational SLAs; experiment design assistants.
- Assumptions/dependencies: Formal nonasymptotic Robbins–Siegmund; Hoeffding-type results; stable estimation of mixing parameters.
Formalization of algorithms with changing behavior policies
- Sector: RL products
- What: Extend proofs to SARSA, projected Q-learning with policy changes, and policy gradient methods where the chain is time-inhomogeneous and coupled to iterates.
- Tools/products/workflows: ODE/stochastic approximation frameworks; new skeleton iterates variants for nonstationary processes.
- Assumptions/dependencies: Formal tools for time-inhomogeneous chains; additional regularity conditions; scalable proof automation.
MDP property inference and testing
- Sector: data science, platform teams
- What: Empirical tests to diagnose irreducibility, aperiodicity, and Doeblin minorization from rollout data and certify assumptions for proof-based guarantees.
- Tools/products/workflows: Statistical test suites; bounds on mixing from samples; conservative certification heuristics.
- Assumptions/dependencies: Sufficient data; reliable estimation under partial observability; error quantification.
Scalable education and outreach
- Sector: education
- What: MOOCs and interactive notebooks that teach formal RL, measure-theoretic probability, and Lean proofs, with bridges to Python RL stacks.
- Tools/products/workflows: Web-based proof notebooks; graded exercises; educator toolkits.
- Assumptions/dependencies: Better UX around Lean; onboarding materials; community support.

View Paper Prompt View All Prompts

Glossary

Aperiodic: A property of a Markov chain indicating it does not get trapped in cycles with period greater than 1. "Let the finite Markov chain $\qty{S_t}$ be irreducible and aperiodic."
Almost sure convergence: Convergence that holds with probability 1 under the underlying probability measure. "In this paper, we formalize the almost sure convergence of $Q$ -learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library."
Banach's fixed point theorem: The contraction mapping principle ensuring a unique fixed point and convergence under a contraction on a complete metric space. "This allows us to invoke Banach's fixed point theorem to conclude the existence and uniqueness of the stationary distribution"
Bellman optimality operator: The operator that maps an action-value function to the one-step optimal Bellman update; its fixed point is the optimal action-value function. "the unique fixed point of the Bellman optimality operator $T_* \in R^{\times } \to R^{ \times }$ "
Conditional expectation: The expected value of a random variable given a sub-σ-algebra, representing information up to a certain time or event structure. "One also has to use a measure theoretic definition of conditional expectations with sub- $\sigma$ -algebras in this probability space."
Contraction: An operator that shrinks distances by a factor strictly less than one under a given norm. "When a stochastic matrix is Doeblin minorizable, the corresponding operator is a contraction in the simplex."
Doeblin minorization: A uniform lower bound condition on transition probabilities ensuring strong mixing; there exists ε>0 and a reference measure ν such that P(i,·) ≥ ε ν(·). "An important consequence of irreducibility and aperiodicity is that they imply Doeblin minorization after sufficient powers."
Dvoretzky's theorem: A classical stochastic-approximation convergence result guaranteeing almost sure convergence for certain algorithms under conditions. "Dvoretzky's theorem can be used to prove the almost sure convergence of some (arguably outdated) version of $Q$ -learning"
Filtration: An increasing sequence of σ-algebras modeling the accumulation of information over time. "There exists a filtration $\qty{F_n}$ such that $x_n$ is measurable by $F_n$ and $E[e_{1, n+1} | \mathcal{F}_n] = 0$ a.s."
Geometric mixing: A property that distances to stationarity contract at a geometric (exponential) rate. "This allows us to invoke Banach's fixed point theorem to conclude the existence and uniqueness of the stationary distribution as well as the geometric mixing property"
Gronwall's inequalities: Integral or discrete inequalities used to bound solutions of differential/difference inequalities and derive convergence rates. "A few Gronwall's inequalities further give $\lim_{t\to\infty}w_t = w_*$ ."
Hoeffding's lemma: A bound on the moment-generating function of bounded random variables, foundational for sub-Gaussian concentration. "Both should be straightforward if we can formalize Hoeffding's lemma."
Ionescu-Tulcea theorem: A theorem constructing a probability measure on infinite product spaces from an initial distribution and a sequence of transition kernels. "This is done by invoking the Ionescu-Tulcea theorem in Mathlib"
Irreducible: A property of a Markov chain where every state can be reached from every other state (possibly in multiple steps). "Let the finite Markov chain $\qty{S_t}$ be irreducible and aperiodic."
Lyapunov function: A nonnegative function acting as an energy measure to certify stability or convergence of iterates by showing it decreases along trajectories. "combining Lyapunov functions \citep{chen2024lyapunov} and Robbins-Siegmund theorem \citep{robbins1971convergence}"
Markov Decision Process (MDP): A framework for sequential decision making with states, actions, transitions, and rewards. "RL theory is typically formulated in the framework of Markov Decision Process (MDP, \citet{bellman1957markovian,puterman2014markov})."
Markov Reward Process (MRP): A Markov chain with rewards (i.e., an MDP with a fixed policy), often used for policy evaluation. "we follow \citet{tsitsiklis1997analysis} and consider a Markov Reward Process (MRP) setup"
Martingale difference sequence: A noise sequence with zero conditional mean given the past, i.e., E[e_{n+1} | F_n] = 0. "We now further assume that $\qty{e_{1, n}$ is a Martingale difference sequence."
Probability kernel: A measurable mapping from each state to a probability measure on the next-state space, representing state-dependent randomness. "using probability kernels from Mathlib."
Pseudo-contraction: A mapping that is contractive under a particular (possibly weighted) norm or seminorm, ensuring a generalized contraction property. "prove that $T_*'$ is a pseudo-contraction, i.e., there exists a $\gamma' \in [0, 1)$ "
Robbins-Monro condition: Step-size requirements for stochastic approximation: α_t > 0, ∑α_t = ∞, ∑α_t² < ∞. "where more step sizes are allowed as long as the step sizes satisfy the Robbins-Monro condition"
Robbins-Siegmund theorem: A convergence result for (almost) supermartingales that yields almost sure convergence under mild summability conditions. "This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem."
Sigma-algebra (σ-algebra): A collection of sets closed under complementation and countable unions, defining measurable events. "Let $\Omega$ be a set equipped with a $\sigma$ -algebra"
Skeleton iterates: A technique analyzing iterates at selected anchor times to convert dependent (Markovian) noise into martingale-difference-like terms. "by using a skeleton iterates techniques \citep{qian2024almost} to convert Markovian noise to Martingale difference noise"
Stationary distribution: A probability distribution π such that πP = π, invariant under the Markov transition. "the stationary state distribution $d_\pi$ "
Stochastic approximation: A class of iterative algorithms using noisy observations to find roots or fixed points of functions. "an ODE based stochastic approximation result"
Supermartingale: A stochastic process whose conditional expectation does not increase over time given the past. "This means that the sequence of functions $\qty{\omega \mapsto \phi(x_n(\omega) - x_*)}$ is almost a supermartingale."
Weighted Bellman optimality operator: A modification of the Bellman optimality operator with weights (e.g., by a behavior policy) to facilitate contraction properties. "define a weighted Bellman optimality operator as $(T_*' q)(s, a) \doteq d_{\pi_q}(s){\pi_q}(a|s)\qty[(T_* q)(s, a) - q(s, a)] + q(s, a)$."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (1)

Shangtong Zhang

Collections

GitHub

GitHub - ShangtongZhang/rl-theory-in-lean: Towards Formalizing RL Theory (3 stars)

Tweets

This paper has been mentioned in 4 tweets and received 325 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

Towards Formalizing Reinforcement Learning Theory (2511.03618v1)

Summary

Formalizing the Almost Sure Convergence of QQQ-Learning and Linear TD in Reinforcement Learning

Introduction

Motivation and Context

Formalization Framework

Mathematical Setting

Probability Space Construction

Unified Proof Strategy

Formal Theorem Statements

Implementation in Lean 4

Numerical and Theoretical Implications

Implications for AI and Formal Methods

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions does the paper ask?

How did they do it? (Methods)

The tools and setup

The main proof idea (explained simply)

Step sizes and conditions (why they matter)

What did they find? (Results)

Implications and potential impact

Key ideas explained in everyday terms

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (1)

Collections

GitHub

Tweets

Formalizing the Almost Sure Convergence of $Q$ -Learning and Linear TD in Reinforcement Learning