Intention-Conditioned Flow Occupancy Models (2506.08902v1)

Published 10 Jun 2025 in cs.LG and cs.AI

Abstract: Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Summary

The paper presents a novel framework that learns latent intentions from mixed-behavior, reward-free datasets using flow matching instead of Monte-Carlo roll-outs.
It leverages generative occupancy modeling and expectile regression to model long-horizon futures and embed implicit generalized policy improvement.
Empirical results show up to 1.8× higher returns and significant success rate gains across diverse locomotion and manipulation tasks.

Intention-Conditioned Flow Occupancy Models (InFOM) introduces a practical recipe for building a reusable “foundation model” for reinforcement-learning tasks that:

Learns latent intentions hidden in large, heterogeneous, reward-free datasets.
Learns a generative model of long-horizon futures (discounted state-occupancy measures) using flow-matching instead of Monte-Carlo roll-outs.
Supports fast downstream adaptation with a single reward-labeled dataset through an implicit form of Generalised Policy Improvement (GPI).

Below is an implementation-oriented walkthrough.

1. Problem setting

Offline data: D = {(s, a, s′, a′)} collected by an unknown mixture of behavioural policies (different “users”). Assumption (Consistency): consecutive transitions come from the same latent intention z.

Fine-tuning data: D_reward = {(s, a, r)} collected in the downstream task, small relative to D.

Goal: From D learn reusable components so that, given D_reward, we can extract a high-performing policy with minimal additional training.

2. Model components

Component	Symbol / network	Trained during	Purpose
Intention encoder	$p_\phi(z\mid s',a')$ (MLP)	pre-train & finetune	infers latent intention from the next transition
Flow occupancy model	vector field $v_\theta(t, s_f^t, s,a,z)$	pre-train & finetune	generates samples $\hat s_f\sim q_\theta(s_f\mid s,a,z)$ that approximate discounted occupancy $p^\beta_\gamma$
Reward predictor	$r_\eta(s)$ (MLP or CNN)	finetune	predicts task reward for generated states
Critic	$Q_\psi(s,a)$	finetune	distils many intention-Q’s into one
Actor / policy	$\pi_\omega(a\mid s)$	finetune	maximises distilled Q while staying near behaviour data

All networks use 4-layer MLPs with 512 hidden units and GELU, except image encoders (small IMPALA).

3. Pre-training stage

3.1 Variational intention inference

Minimise the following flow-ELBO (weighted by λ):

1	L(φ,θ) = FlowLoss(vθ, pφ) + λ · KL( pφ(z\|s',a') \|\| N(0,I) )

Intuition: (1) encoder compresses (s′, a′) into latent z (information bottleneck). (2) flow model reconstructs future distribution conditioned on z, forcing encoder to capture task-specific intent.

3.2 SARSA-style Temporal-Difference Flow Matching

Use TD-Flow Matching [farebrother2025temporal] to avoid Monte-Carlo trajectories:

1	L_TD = (1-γ)·CurrentFlow + γ·FutureFlow

Current term fits flow so that if t→0 we reproduce current state.
Future term bootstraps from (s',a') sampled from D and encoder-generated z.
Target vector field uses EMA to stabilise learning (like target Q-network).

Implementation snippet (JAX-like pseudo-code):

def td_flow_loss(params_v, params_v_target, params_enc, batch, γ):
    s,a,s_next,a_next = batch               # shape [B, ...]
    z = encoder(params_enc, s_next, a_next) # [B, d]
    t = random.uniform(key, shape=[B,1])
    eps = random.normal(key, shape=s.shape)

    # Current flow term
    s_t = t * s + (1-t) * eps
    v_pred = flow(params_v, t, s_t, s, a, z)
    loss_curr = jnp.mean((v_pred - (s - eps))**2)

    # Future flow ε-step via Euler
    s_future = euler_integrate(params_v_target, eps, s_next, a_next, z)
    s_future_t = t * s_future + (1-t) * eps
    v_target = flow(params_v_target, t, s_future_t, s, a, z)
    v_pred2  = flow(params_v,         t, s_future_t, s, a, z)
    loss_future = jnp.mean((v_pred2 - v_target)**2)

    return (1-γ)*loss_curr + γ*loss_future

4. Fine-tuning stage

4.1 Generative value estimation

Sample N futures from flow, evaluate reward, average → MC estimate of intention-specific Q:

1 2	s_futures = [euler_integrate(vθ, ε_i, s, a, z) for i in range(N)] Q_hat = jnp.mean(rη(s_futures)) / (1 - γ)

4.2 Implicit GPI via expectile regression

We have infinitely many z ⇒ cannot explicit max_z. Instead, distil stochastic Q̂’s into a single critic by upper expectile (μ≈0.9):

1	L_critic = E_{s,a,z}[ L2^μ( Qψ(s,a) - Q̂(s,a,z) ) ]

Expectile plays role of soft-max across intentions; as μ→1 it approximates max.

4.3 Policy update

Entropy-regularised actor with behaviour cloning penalty α (conservative):

1	L_actor = - E_{s∼D_reward, a~π}[ Qψ(s,a) + α·log π(a\|s) ]

Practical tips:

Update actor 4× slower than critic (asynchronous training).
For image tasks, share CNN encoder between reward predictor, critic and policy.

5. Empirical results

Benchmarks:

16 ExORL locomotion tasks (state)
20 OGBench manipulation tasks (state)
4 OGBench manipulation tasks (RGB images)

Key numbers:

Median 1.8× higher return and +36 % success rate vs best baseline.
Jaco arm: 20× improvement vs BC-style baselines (credits modelling long-horizon).
Vision tasks: survives raw-pixel input where model-free baselines fail.

Ablations:

Replace variational z with hand-crafted skill encoders (FB, HILP) → worse or similar, but heavier to train.
Replace implicit GPI by finite-sample GPI → larger variance & 44 % lower score.
Remove z (one-step PI) → big drop, shows intentions matter.

Training cost: 1M pre-train + 0.5M fine-tune gradient steps (~4 h on single A6000 for state tasks).

6. Practical deployment guidelines

Data: Works with any mixed-quality trajectories; more heterogeneity ⇒ richer intentions.
Latent size: 128–512; tune on one task per domain (larger for locomotion, smaller for manipulation).
Flow solver: Euler with 10 steps is enough; no need for expensive higher-order solvers.
Regularisation:
- KL-weight λ ∈ 0.01, 0.2.
- Behaviour-cloning α critical – prevents Q-extrapolation in sparse-reward domains.
Scaling to images: Add small IMPALA backbone, random-crop augmentation; keep flow in latent feature space to save memory.
Serving: At runtime you only need the policy network. Flow and encoder are required only if you want on-the-fly adaptation to new rewards.

7. Take-aways for practitioners

Generative occupancy modelling with flow matching provides stable, unrolled access to long-term futures—no rollout jitters or compounding error.
Latent-variable conditioning lets the same model explain multi-task datasets while avoiding the cost of learning one policy per skill.
Expectile distillation is a simple, derivative-free way to embed GPI inside standard actor-critic codebases.

Overall, InFOM offers a modular and computationally tractable blueprint for anyone wanting to turn large, unlabelled robot logs into a single reusable policy backbone.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/chongyiz1/status/1933183954062069865

https://twitter.com/fly51fly/status/1934010289110053344

https://twitter.com/chongyiz1/status/1933183969358713010