Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unified Training for SFM & AFM

Updated 25 November 2025
  • The paper presents a unified training procedure that integrates Synchronous Flow Matching (SFM) and Asynchronous Flow Matching (AFM) for efficient robotic action generation.
  • It employs a joint loss formulation that probabilistically mixes synchronous and asynchronous learning modes through mixed-mode minibatches and a diffusion-based masking strategy.
  • The approach enhances model performance by improving data usage, KV-cache efficiency, and enabling self-correction via a dedicated confidence rater module during inference.

A unified training procedure for SFM (Synchronous Flow Matching) and AFM (Asynchronous Flow Matching) enables a single model to support both uniform and asynchronous token-level action generation within vision-language-action (VLA) frameworks. This paradigm facilitates joint learning of temporally rigid and context-adaptive action policies, resulting in improved model efficiency, data usage, and self-correction capability during long-horizon robotic tasks (Jiang et al., 18 Nov 2025). The following presents a technical overview of unified SFM/AFM training, including the underlying objectives, joint loss formulation, practical implementation, and architectural modules.

1. Foundations: SFM and AFM

SFM and AFM are distinct approaches for flow matching in trajectory or sequence modeling, particularly for action generation in robotic agents:

  • Synchronous Flow Matching (SFM): Every action token in a trajectory is denoised simultaneously during training (i.e., the entire action sequence is generated at each step).
  • Asynchronous Flow Matching (AFM): A random subset of action tokens is masked and regenerated at each iteration, while unmasked tokens remain fixed, thus introducing temporal selectivity and allowing localized refinement.

The core difference lies in the mask vector m∈{0,1}Lm \in \{0,1\}^L. For SFM, m=1Lm = 1^L; for AFM, entries of mm are sampled independently via ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y), with y∼Uniform(0,1)y \sim \text{Uniform}(0,1) (Jiang et al., 18 Nov 2025).

2. Unified Loss Formulation

Unified SFM/AFM training leverages a single stochastic objective that subsumes both modes by probabilistically mixing synchronous and asynchronous supervision in each minibatch. The total training loss is given by:

Ltotal(θ)=Ey,m,τ∥[Vθ(o,ℓ,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_2

  • oto_t: Multi-view observation (images and robot state)
  • â„“\ell: Instruction embedding
  • aa: Ground-truth action sequence
  • nn: Gaussian noise sample
  • m=1Lm = 1^L0: Diffusion time step (m=1Lm = 1^L1)
  • m=1Lm = 1^L2: Velocity prediction network
  • m=1Lm = 1^L3: Mask vector (m=1Lm = 1^L4 for SFM; m=1Lm = 1^L5 sampled as above for AFM)
  • m=1Lm = 1^L6: Element-wise multiplication

When m=1Lm = 1^L7, this reduces to standard SFM; for partial m=1Lm = 1^L8, it implements AFM. The expectation averages over random noise, mask, and diffusion steps.

3. Unified Training Algorithm

Training proceeds by drawing mixed-mode minibatches that interleave SFM and various AFM regimes. The per-iteration procedure is:

  1. Sample batch of size m=1Lm = 1^L9 from dataset mm0 consisting of mm1 tuples.
  2. For each sample mm2:
    • Sample mm3.
    • For each token mm4, sample mask mm5.
    • Sample mm6; noise mm7.
    • Compute velocity target mm8.
    • Generate noisy actions mm9.
    • Predict ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)0.
  3. Compute ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)1 and update ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)2.

Because ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)3 sometimes equals ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)4, SFM gradient signals are always included (Jiang et al., 18 Nov 2025).

4. Data Processing and Batching

Fixed-length action chunks of length ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)5 are extracted from demonstration trajectories, along with matching observation sequences and instruction embeddings:

  • Actions are normalized (zero mean, unit variance).
  • Chunks shorter than ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)6 are padded; the corresponding mask entries are set to 0 to avoid incurring spurious loss.
  • Batches are assembled at the chunk level and shuffled each epoch.

This batching approach supports efficient interleaving of SFM and AFM within large-scale distributed training runs.

5. Confidence Rater Module

After model training, a separate confidence rater ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)7 is developed to enable mask-driven self-correction at inference:

  • Consists of 4 transformer layers (hidden-size ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)8, 32 heads, FFN width 6144), linear + sigmoid output head.
  • Input: token embeddings of ml∼Bernoulli(y)m_l \sim \text{Bernoulli}(y)9, y∼Uniform(0,1)y \sim \text{Uniform}(0,1)0, and first-round SFM-predicted actions.
  • For each token y∼Uniform(0,1)y \sim \text{Uniform}(0,1)1, per-token MSE is computed between first-pass prediction y∼Uniform(0,1)y \sim \text{Uniform}(0,1)2 and ground truth y∼Uniform(0,1)y \sim \text{Uniform}(0,1)3, then rescaled as:

y∼Uniform(0,1)y \sim \text{Uniform}(0,1)4

with y∼Uniform(0,1)y \sim \text{Uniform}(0,1)5, y∼Uniform(0,1)y \sim \text{Uniform}(0,1)6, y∼Uniform(0,1)y \sim \text{Uniform}(0,1)7.

  • y∼Uniform(0,1)y \sim \text{Uniform}(0,1)8.

At test time, tokens with predicted confidence y∼Uniform(0,1)y \sim \text{Uniform}(0,1)9 (Ltotal(θ)=Ey,m,τ∥[Vθ(o,ℓ,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_20) are masked for asynchronous self-correction passes.

6. Optimization, Hyperparameters, and Inference

Optimization:

  • Pre-training: Batch size 2048, AdamW optimizer (Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_21, weight decay 0).
  • Learning rates: Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_22 for text and FM head; Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_23 for vision encoder.
  • Cosine LR decay schedule, pre-training for Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_24 epochs; 20–50 epochs for fine-tuning.

Diffusion and Masking:

  • 10-step uniform diffusion schedule.
  • Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_25 biases early training to high-noise states.
  • No explicit curriculum: stochastic Ltotal(θ)=Ey,m,τ∥[Vθ(o,â„“,a−τ(a−n)⊙m)−(n−a)]⊙m∥22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_26 ensures uniform mixing of SFM and AFM.

Inference:

  • SFM generates initial actions for the whole chunk.
  • Confidence rater identifies low-confidence tokens.
  • AFM refines only the masked slots, reusing the KV-cache from the SFM pass for computational efficiency.

7. Practical Implications and Considerations

Unified SFM+AFM training, as instantiated in AsyncVLA (Jiang et al., 18 Nov 2025), allows a single VLA model to flexibly operate in both fully synchronous and mask-driven asynchronous regimes. Notably, this approach:

  • Achieves self-correction in long-horizon tasks by localized action refinement.
  • Eliminates the need for separate heads or alternating training schedules.
  • Enhances data efficiency and model generalization without imposing extra curriculum learning mechanisms.
  • Improves KV-cache utilization during inference, with only masked action tokens requiring recomputation.

A plausible implication is that unified SFM/AFM procedures generalize beyond robotic embodied action, providing a template for training sequence models that require both global and local token regeneration capabilities within a single framework. This methodological innovation directly enables state-of-the-art results on generalist robotics tasks through improved stability and adaptability of action generation (Jiang et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unified Training Procedure for SFM and AFM.