- The paper presents a novel framework that learns latent intentions from mixed-behavior, reward-free datasets using flow matching instead of Monte-Carlo roll-outs.
- It leverages generative occupancy modeling and expectile regression to model long-horizon futures and embed implicit generalized policy improvement.
- Empirical results show up to 1.8× higher returns and significant success rate gains across diverse locomotion and manipulation tasks.
Intention-Conditioned Flow Occupancy Models (InFOM) introduces a practical recipe for building a reusable “foundation model” for reinforcement-learning tasks that:
- Learns latent intentions hidden in large, heterogeneous, reward-free datasets.
- Learns a generative model of long-horizon futures (discounted state-occupancy measures) using flow-matching instead of Monte-Carlo roll-outs.
- Supports fast downstream adaptation with a single reward-labeled dataset through an implicit form of Generalised Policy Improvement (GPI).
Below is an implementation-oriented walkthrough.
1. Problem setting
Offline data:
D = {(s, a, s′, a′)} collected by an unknown mixture of behavioural policies (different “users”).
Assumption (Consistency): consecutive transitions come from the same latent intention z.
Fine-tuning data:
D_reward = {(s, a, r)} collected in the downstream task, small relative to D.
Goal: From D learn reusable components so that, given D_reward, we can extract a high-performing policy with minimal additional training.
2. Model components
Component |
Symbol / network |
Trained during |
Purpose |
Intention encoder |
pϕ(z∣s′,a′) (MLP) |
pre-train & finetune |
infers latent intention from the next transition |
Flow occupancy model |
vector field vθ(t,sft,s,a,z) |
pre-train & finetune |
generates samples s^f∼qθ(sf∣s,a,z) that approximate discounted occupancy pγβ |
Reward predictor |
rη(s) (MLP or CNN) |
finetune |
predicts task reward for generated states |
Critic |
Qψ(s,a) |
finetune |
distils many intention-Q’s into one |
Actor / policy |
πω(a∣s) |
finetune |
maximises distilled Q while staying near behaviour data |
All networks use 4-layer MLPs with 512 hidden units and GELU, except image encoders (small IMPALA).
3. Pre-training stage
3.1 Variational intention inference
Minimise the following flow-ELBO (weighted by λ):
1
|
L(φ,θ) = FlowLoss(vθ, pφ) + λ · KL( pφ(z|s',a') || N(0,I) ) |
Intuition:
(1) encoder compresses (s′, a′) into latent z (information bottleneck).
(2) flow model reconstructs future distribution conditioned on z, forcing encoder to capture task-specific intent.
3.2 SARSA-style Temporal-Difference Flow Matching
Use TD-Flow Matching [farebrother2025temporal] to avoid Monte-Carlo trajectories:
1
|
L_TD = (1-γ)·CurrentFlow + γ·FutureFlow |
- Current term fits flow so that if t→0 we reproduce current state.
- Future term bootstraps from (s',a') sampled from D and encoder-generated z.
- Target vector field uses EMA to stabilise learning (like target Q-network).
Implementation snippet (JAX-like pseudo-code):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
def td_flow_loss(params_v, params_v_target, params_enc, batch, γ):
s,a,s_next,a_next = batch # shape [B, ...]
z = encoder(params_enc, s_next, a_next) # [B, d]
t = random.uniform(key, shape=[B,1])
eps = random.normal(key, shape=s.shape)
# Current flow term
s_t = t * s + (1-t) * eps
v_pred = flow(params_v, t, s_t, s, a, z)
loss_curr = jnp.mean((v_pred - (s - eps))**2)
# Future flow ε-step via Euler
s_future = euler_integrate(params_v_target, eps, s_next, a_next, z)
s_future_t = t * s_future + (1-t) * eps
v_target = flow(params_v_target, t, s_future_t, s, a, z)
v_pred2 = flow(params_v, t, s_future_t, s, a, z)
loss_future = jnp.mean((v_pred2 - v_target)**2)
return (1-γ)*loss_curr + γ*loss_future |
4. Fine-tuning stage
4.1 Generative value estimation
Sample N futures from flow, evaluate reward, average → MC estimate of intention-specific Q:
1
2
|
s_futures = [euler_integrate(vθ, ε_i, s, a, z) for i in range(N)]
Q_hat = jnp.mean(rη(s_futures)) / (1 - γ) |
4.2 Implicit GPI via expectile regression
We have infinitely many z ⇒ cannot explicit max_z
.
Instead, distil stochastic Q̂’s into a single critic by upper expectile (μ≈0.9):
1
|
L_critic = E_{s,a,z}[ L2^μ( Qψ(s,a) - Q̂(s,a,z) ) ] |
Expectile plays role of soft-max across intentions; as μ→1 it approximates max.
4.3 Policy update
Entropy-regularised actor with behaviour cloning penalty α (conservative):
1
|
L_actor = - E_{s∼D_reward, a~π}[ Qψ(s,a) + α·log π(a|s) ] |
Practical tips:
- Update actor 4× slower than critic (asynchronous training).
- For image tasks, share CNN encoder between reward predictor, critic and policy.
5. Empirical results
Benchmarks:
- 16 ExORL locomotion tasks (state)
- 20 OGBench manipulation tasks (state)
- 4 OGBench manipulation tasks (RGB images)
Key numbers:
- Median 1.8× higher return and +36 % success rate vs best baseline.
- Jaco arm: 20× improvement vs BC-style baselines (credits modelling long-horizon).
- Vision tasks: survives raw-pixel input where model-free baselines fail.
Ablations:
- Replace variational z with hand-crafted skill encoders (FB, HILP) → worse or similar, but heavier to train.
- Replace implicit GPI by finite-sample GPI → larger variance & 44 % lower score.
- Remove z (one-step PI) → big drop, shows intentions matter.
Training cost: 1M pre-train + 0.5M fine-tune gradient steps (~4 h on single A6000 for state tasks).
6. Practical deployment guidelines
- Data: Works with any mixed-quality trajectories; more heterogeneity ⇒ richer intentions.
- Latent size: 128–512; tune on one task per domain (larger for locomotion, smaller for manipulation).
- Flow solver: Euler with 10 steps is enough; no need for expensive higher-order solvers.
- Regularisation:
- KL-weight λ ∈ 0.01, 0.2.
- Behaviour-cloning α critical – prevents Q-extrapolation in sparse-reward domains.
- Scaling to images: Add small IMPALA backbone, random-crop augmentation; keep flow in latent feature space to save memory.
- Serving: At runtime you only need the policy network. Flow and encoder are required only if you want on-the-fly adaptation to new rewards.
7. Take-aways for practitioners
- Generative occupancy modelling with flow matching provides stable, unrolled access to long-term futures—no rollout jitters or compounding error.
- Latent-variable conditioning lets the same model explain multi-task datasets while avoiding the cost of learning one policy per skill.
- Expectile distillation is a simple, derivative-free way to embed GPI inside standard actor-critic codebases.
Overall, InFOM offers a modular and computationally tractable blueprint for anyone wanting to turn large, unlabelled robot logs into a single reusable policy backbone.