Papers
Topics
Authors
Recent
2000 character limit reached

Unified Training for SFM & AFM

Updated 25 November 2025
  • The paper presents a unified training procedure that integrates Synchronous Flow Matching (SFM) and Asynchronous Flow Matching (AFM) for efficient robotic action generation.
  • It employs a joint loss formulation that probabilistically mixes synchronous and asynchronous learning modes through mixed-mode minibatches and a diffusion-based masking strategy.
  • The approach enhances model performance by improving data usage, KV-cache efficiency, and enabling self-correction via a dedicated confidence rater module during inference.

A unified training procedure for SFM (Synchronous Flow Matching) and AFM (Asynchronous Flow Matching) enables a single model to support both uniform and asynchronous token-level action generation within vision-language-action (VLA) frameworks. This paradigm facilitates joint learning of temporally rigid and context-adaptive action policies, resulting in improved model efficiency, data usage, and self-correction capability during long-horizon robotic tasks (Jiang et al., 18 Nov 2025). The following presents a technical overview of unified SFM/AFM training, including the underlying objectives, joint loss formulation, practical implementation, and architectural modules.

1. Foundations: SFM and AFM

SFM and AFM are distinct approaches for flow matching in trajectory or sequence modeling, particularly for action generation in robotic agents:

  • Synchronous Flow Matching (SFM): Every action token in a trajectory is denoised simultaneously during training (i.e., the entire action sequence is generated at each step).
  • Asynchronous Flow Matching (AFM): A random subset of action tokens is masked and regenerated at each iteration, while unmasked tokens remain fixed, thus introducing temporal selectivity and allowing localized refinement.

The core difference lies in the mask vector m{0,1}Lm \in \{0,1\}^L. For SFM, m=1Lm = 1^L; for AFM, entries of mm are sampled independently via mlBernoulli(y)m_l \sim \text{Bernoulli}(y), with yUniform(0,1)y \sim \text{Uniform}(0,1) (Jiang et al., 18 Nov 2025).

2. Unified Loss Formulation

Unified SFM/AFM training leverages a single stochastic objective that subsumes both modes by probabilistically mixing synchronous and asynchronous supervision in each minibatch. The total training loss is given by:

Ltotal(θ)=Ey,m,τ[Vθ(o,,aτ(an)m)(na)]m22L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_2

  • oto_t: Multi-view observation (images and robot state)
  • \ell: Instruction embedding
  • aa: Ground-truth action sequence
  • nn: Gaussian noise sample
  • τ\tau: Diffusion time step (Beta(1.5,1)\sim \text{Beta}(1.5, 1))
  • VθV_\theta: Velocity prediction network
  • mm: Mask vector (m=1Lm = 1^L for SFM; mlm_l sampled as above for AFM)
  • \odot: Element-wise multiplication

When m=1Lm = 1^L, this reduces to standard SFM; for partial mm, it implements AFM. The expectation averages over random noise, mask, and diffusion steps.

3. Unified Training Algorithm

Training proceeds by drawing mixed-mode minibatches that interleave SFM and various AFM regimes. The per-iteration procedure is:

  1. Sample batch of size BB from dataset D\mathcal{D} consisting of (ot,a,)(o_t, a, \ell) tuples.
  2. For each sample ii:
    • Sample yiUniform(0,1)y_i \sim \text{Uniform}(0,1).
    • For each token ll, sample mask ml(i)Bernoulli(yi)m^{(i)}_l \sim \text{Bernoulli}(y_i).
    • Sample τiBeta(1.5,1)\tau_i \sim \text{Beta}(1.5, 1); noise n(i)N(0,I)n^{(i)} \sim \mathcal{N}(0, I).
    • Compute velocity target u(i)=n(i)a(i)u^{(i)} = n^{(i)} - a^{(i)}.
    • Generate noisy actions a^(i)(τi)=a(i)τi(a(i)n(i))m(i)\hat{a}^{(i)}(\tau_i) = a^{(i)} - \tau_i (a^{(i)} - n^{(i)}) \odot m^{(i)}.
    • Predict v(i)=Vθ(ot(i),(i),a^(i)(τi))v^{(i)} = V_\theta(o^{(i)}_t, \ell^{(i)}, \hat{a}^{(i)}(\tau_i)).
  3. Compute 1Bi=1B(v(i)u(i))m(i)22\frac{1}{B} \sum_{i=1}^B \| (v^{(i)} - u^{(i)}) \odot m^{(i)} \|_2^2 and update θ\theta.

Because mm sometimes equals 1L1^L, SFM gradient signals are always included (Jiang et al., 18 Nov 2025).

4. Data Processing and Batching

Fixed-length action chunks of length LL are extracted from demonstration trajectories, along with matching observation sequences and instruction embeddings:

  • Actions are normalized (zero mean, unit variance).
  • Chunks shorter than LL are padded; the corresponding mask entries are set to 0 to avoid incurring spurious loss.
  • Batches are assembled at the chunk level and shuffled each epoch.

This batching approach supports efficient interleaving of SFM and AFM within large-scale distributed training runs.

5. Confidence Rater Module

After model training, a separate confidence rater fϕf_\phi is developed to enable mask-driven self-correction at inference:

  • Consists of 4 transformer layers (hidden-size dd, 32 heads, FFN width 6144), linear + sigmoid output head.
  • Input: token embeddings of oto_t, \ell, and first-round SFM-predicted actions.
  • For each token ll, per-token MSE is computed between first-pass prediction a^SFM,l\hat{a}_{SFM,l} and ground truth ala_l, then rescaled as:

ql=1αβelminemaxemine+ϵq_l = 1 - \alpha - \beta\frac{e_l - \min_e}{\max_e - \min_e} + \epsilon

with α=0.01\alpha=0.01, β=0.98\beta=0.98, ϵ=106\epsilon=10^{-6}.

  • Lrater(ϕ)=Echunksl=1L(plql)2L_{rater}(\phi) = \mathbb{E}_{chunks} \sum_{l=1}^L (p_l - q_l)^2.

At test time, tokens with predicted confidence pl<Tp_l < T (T=0.5T = 0.5) are masked for asynchronous self-correction passes.

6. Optimization, Hyperparameters, and Inference

Optimization:

  • Pre-training: Batch size 2048, AdamW optimizer (β1=0.9,β2=0.999\beta_1=0.9, \beta_2=0.999, weight decay 0).
  • Learning rates: 1×1041\times10^{-4} for text and FM head; 2×1052\times10^{-5} for vision encoder.
  • Cosine LR decay schedule, pre-training for 200\sim 200 epochs; 20–50 epochs for fine-tuning.

Diffusion and Masking:

  • 10-step uniform diffusion schedule.
  • τBeta(1.5,1)\tau \sim \text{Beta}(1.5, 1) biases early training to high-noise states.
  • No explicit curriculum: stochastic yy ensures uniform mixing of SFM and AFM.

Inference:

  • SFM generates initial actions for the whole chunk.
  • Confidence rater identifies low-confidence tokens.
  • AFM refines only the masked slots, reusing the KV-cache from the SFM pass for computational efficiency.

7. Practical Implications and Considerations

Unified SFM+AFM training, as instantiated in AsyncVLA (Jiang et al., 18 Nov 2025), allows a single VLA model to flexibly operate in both fully synchronous and mask-driven asynchronous regimes. Notably, this approach:

  • Achieves self-correction in long-horizon tasks by localized action refinement.
  • Eliminates the need for separate heads or alternating training schedules.
  • Enhances data efficiency and model generalization without imposing extra curriculum learning mechanisms.
  • Improves KV-cache utilization during inference, with only masked action tokens requiring recomputation.

A plausible implication is that unified SFM/AFM procedures generalize beyond robotic embodied action, providing a template for training sequence models that require both global and local token regeneration capabilities within a single framework. This methodological innovation directly enables state-of-the-art results on generalist robotics tasks through improved stability and adaptability of action generation (Jiang et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Unified Training Procedure for SFM and AFM.