Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Intention-Conditioned Flow Occupancy Models

Updated 30 June 2025
  • InFOM are probabilistic RL models that use flow matching and latent intention inference to model discounted state occupancy distributions.
  • The approach integrates temporal-difference learning with generative modeling to enable robust, multi-step predictions from off-policy data.
  • Empirical results show a 1.8x improvement in returns and a 36% increase in success rates over prior methods, demonstrating its scalability in diverse RL challenges.

Intention-Conditioned Flow Occupancy Models (InFOM) represent a probabilistic modeling paradigm within reinforcement learning (RL) in which the future state occupancy distribution of an agent is learned as a function of a latent intention variable. InFOM leverages principles from generative modeling—specifically, flow matching—to construct expressive, reusable "foundation models" that encode the temporal and intentional structure of large, diverse, task-agnostic datasets. The method is motivated by the desire to provide an RL equivalent of large-scale pre-training in vision and language domains, enabling sample-efficient, robust adaptation to downstream tasks by conditioning on inferred user intent.

1. Fundamental Principles and Model Formulation

Intention-Conditioned Flow Occupancy Models are designed to model the discounted state occupancy measure of an agent in a Markov Decision Process, where the occupancy distribution is explicitly conditioned on a latent intention variable capturing user or agent goals. Formally, for a given policy π\pi, the (discounted) occupancy measure is

pγπ(sfs,a)=(1γ)h=0γhphπ(sfs,a),p^\pi_\gamma(s_f | s, a) = (1 - \gamma) \sum_{h=0}^\infty \gamma^h p^\pi_h(s_f | s, a),

where phπ(sfs,a)p^\pi_h(s_f | s, a) is the probability density of reaching sfs_f after hh steps from state-action pair (s,a)(s, a).

InFOM models this distribution using a generative flow-matching approach, where the occupancy is learned as the stationary distribution induced by a neural ordinary differential equation (ODE) or a flow field v(t,x)v(t, x) diffusing simple noise into plausible future states. The approach further introduces a latent intention variable zz that is inferred from behavioral context (e.g., from a subsequent transition (s,a)(s', a')), and both inference and generative modeling are performed with explicit conditioning on this intention.

2. Flow Matching and Temporal Difference Learning

The central training mechanism is a flow matching loss inspired by the recent literature on generative flow models. Unlike vanilla flow matching, InFOM incorporates temporal difference learning through the use of a "SARSA flow," enabling multi-step temporal bootstrapping:

LSARSA flow(vd,pe)=(1γ)Lcurrent+γLfuture,\mathcal{L}_{\text{SARSA flow}}(v_d, p_e) = (1-\gamma) \mathcal{L}_{\text{current}} + \gamma \mathcal{L}_{\text{future}},

where

Lcurrent=E(s,a,s,a),z,t,ϵ[v(t,st,s,a,z)(sϵ)2],\mathcal{L}_{\text{current}} = \mathbb{E}_{(s, a, s', a'), z, t, \epsilon} \left[ \| v(t, s^t, s, a, z) - (s - \epsilon) \|^2 \right],

Lfuture=E(s,a,s,a),z,t,ϵ[vd(t,sˉft,s,a,z)vˉd(t,sˉft,s,a,z)2].\mathcal{L}_{\text{future}} = \mathbb{E}_{(s, a, s', a'), z, t, \epsilon} \left[ \| v_d(t, \bar{s}_f^t, s, a, z) - \bar{v}_d(t, \bar{s}_f^t, s, a, z) \|^2 \right].

Here, vdv_d is the flow field parameterizing the generative process, pe(zs,a)p_e(z | s', a') is the encoder's variational posterior for intention, and γ\gamma is the discount factor. The "current" term matches the vector field to observed transitions, while the "future" term interacts similarly with next-step predictions in a dynamic programming fashion.

This enables InFOM to exploit the structure of RL for efficient learning from off-policy, reward-free data, while still capturing long-horizon dependencies.

3. Latent Intention Inference and Conditioning

A pivotal feature of InFOM is its approach to latent intention modeling. The model assumes that observed agent behavior is driven by unobserved task-specific intent, which is treated as a stochastic latent variable zz. InFOM infers zz from pairs of consecutive states and actions (s,a)(s', a') using a variational encoder. The model is trained using a variational lower bound (ELBO): L(pe,qd)=Epβ(s,a,sf,s,a)[Epe(zs,a)[logqd(sfs,a,z)]λKL(pe(zs,a)p(z))]L(p_e, q_d) = \mathbb{E}_{p^\beta(s, a, s_f, s', a')}\Big[ \mathbb{E}_{p_e(z | s', a')} \big[ \log q_d(s_f | s, a, z) \big] - \lambda \mathrm{KL}\big(p_e(z | s', a') || p(z) \big) \Big] where qdq_d specifies the conditional state occupancy flow, while the KL\mathrm{KL} term regularizes intention inference towards a prior.

Conditioning future state prediction on zz allows the model to (1) generate future state distributions specific to different intended behaviors, (2) improve expressivity and coverage on diverse datasets, and (3) support generalized policy improvement by maximizing over inferred intentions during downstream adaptation.

4. Policy Extraction via Generalized Policy Improvement

During fine-tuning on downstream tasks, InFOM performs generalized policy improvement (GPI). For a given state-action-intention tuple, a Q-value estimate is constructed via Monte Carlo: Q^(s,a,z)=1(1γ)Ni=1Nr(sf(i)),sf(i)qd(sfs,a,z).\hat{Q}(s, a, z) = \frac{1}{(1 - \gamma) N} \sum_{i=1}^N r(s_f^{(i)}), \quad s_f^{(i)} \sim q_d(s_f | s, a, z). To avoid instability associated with maximizing over a finite set of zz, a Q-function distillation approach is adopted using an upper expectile loss: Lexpectile(Q)=E(s,a),zp(z)[L2μ(Q(s,a)Q^(s,a,z))],\mathcal{L}^{\text{expectile}}(Q) = \mathbb{E}_{(s, a), z \sim p(z)} \left[ L_2^\mu(Q(s, a) - \hat{Q}(s, a, z)) \right], which achieves an effect similar to soft generalized policy improvement across the continuous latent intention space. The resulting Q-function is then used in policy improvement or for extracting an adaptive actor via standard policy optimization (with regularization).

5. Empirical Performance and Comparison to Baselines

InFOM has been evaluated on 36 state-based and 4 image-based benchmark RL tasks, including challenging robotic manipulation and navigation scenarios. Experiments follow a two-stage protocol: reward-free pre-training on large, diverse datasets (with many users/tasks/intents) and subsequent fine-tuning on specific downstream tasks.

Results highlight:

  • InFOM achieves a 1.8x median improvement in returns and increases success rates by 36% over previous state-of-the-art methods, including actor-critic, behavioral cloning, successor features, and representation learning baselines.
  • The method is especially effective on domains with diverse or weak reward structure, and where unmodeled intention multimodality degrades direct policy or Q-function pre-training.
  • Ablation studies reveal that (1) the variational latent intention approach is more effective than skill discovery alternatives (e.g., HILP), and (2) implicit Q-value distillation via expectile provides more robust adaptation compared to explicit maximization.

6. Practical Implications and Applications

The intention-conditioned structure of InFOM directly addresses fundamental challenges in RL foundation modeling:

  • Temporal Abstraction: By predicting entire occupancy distributions, InFOM is able to reason over arbitrarily long horizons, capturing consequences of actions that may only be revealed after many steps.
  • Intentional Generalization: Modeling intentions via a latent variable increases expressivity, enabling models trained on heterogeneous agent data to adapt flexibly to new or recomposed downstream tasks.
  • Sample Efficiency: The foundation model approach allows rapid fine-tuning via policy improvement and distillation, reducing the need for task-specific data collection.
  • Applicability: InFOM is suitable for robotics (manipulation and navigation), multi-task RL, and generalist agents, and is extensible to both low-dimensional (state) and image-based RL domains.

7. Comparative Perspective and Limitations

Relative to prior representation learning (DINO, CRL), model-based RL (MBPO), behavioral cloning, and skill discovery approaches, InFOM is uniquely designed to unify modeling of future state distributions and latent intentions. Successor-feature methods and alternatives such as HILP require more structure or explicit actor-critic RL, while InFOM directly fits generative state distributions.

A limitation is that the latent intention encoder currently infers intentions from pairs rather than sequences of transitions, which may limit its ability to model highly complex or long-term intentions. There can also be sensitivity to hyperparameters such as the latent code dimension or ELBO regularization.

Table: Comparison Overview

Feature InFOM Approach Prior Methods
Models occupancy (future states) Yes (flow matching) No or trajectory-only
Intention conditioning Continuous latent (ELBO) Discrete/none
Temporal abstraction Discounted occupancy 1-step/actor-centric
Q-value extraction Q-distillation (expectile) Explicit maximization
Downstream return, success Highest (1.8x, +36%) Lower

Conclusion

Intention-Conditioned Flow Occupancy Models establish a scalable, generalizable paradigm for RL foundation models. By integrating flow matching, variational latent intention inference, and generalized policy improvement, InFOM achieves significant improvements in performance and adaptability on challenging, multi-intent domains, setting a new direction for the development of expressive, reusable RL solutions pre-trained on large, heterogenous behavioral data.