Unified Training for SFM & AFM

Updated 25 November 2025

The paper presents a unified training procedure that integrates Synchronous Flow Matching (SFM) and Asynchronous Flow Matching (AFM) for efficient robotic action generation.
It employs a joint loss formulation that probabilistically mixes synchronous and asynchronous learning modes through mixed-mode minibatches and a diffusion-based masking strategy.
The approach enhances model performance by improving data usage, KV-cache efficiency, and enabling self-correction via a dedicated confidence rater module during inference.

A unified training procedure for SFM (Synchronous Flow Matching) and AFM (Asynchronous Flow Matching) enables a single model to support both uniform and asynchronous token-level action generation within vision-language-action (VLA) frameworks. This paradigm facilitates joint learning of temporally rigid and context-adaptive action policies, resulting in improved model efficiency, data usage, and self-correction capability during long-horizon robotic tasks (Jiang et al., 18 Nov 2025). The following presents a technical overview of unified SFM/AFM training, including the underlying objectives, joint loss formulation, practical implementation, and architectural modules.

1. Foundations: SFM and AFM

SFM and AFM are distinct approaches for flow matching in trajectory or sequence modeling, particularly for action generation in robotic agents:

Synchronous Flow Matching (SFM): Every action token in a trajectory is denoised simultaneously during training (i.e., the entire action sequence is generated at each step).
Asynchronous Flow Matching (AFM): A random subset of action tokens is masked and regenerated at each iteration, while unmasked tokens remain fixed, thus introducing temporal selectivity and allowing localized refinement.

The core difference lies in the mask vector $m \in \{0,1\}^L$ . For SFM, $m = 1^L$ ; for AFM, entries of $m$ are sampled independently via $m_l \sim \text{Bernoulli}(y)$ , with $y \sim \text{Uniform}(0,1)$ (Jiang et al., 18 Nov 2025).

2. Unified Loss Formulation

Unified SFM/AFM training leverages a single stochastic objective that subsumes both modes by probabilistically mixing synchronous and asynchronous supervision in each minibatch. The total training loss is given by:

$L_{\mathrm{total}}(\theta) = \mathbb{E}_{y, m, \tau}\left\| \left[ V_\theta(o, \ell, a - \tau(a - n) \odot m) - (n-a) \right] \odot m \right\|^2_2$

$o_t$ : Multi-view observation (images and robot state)
$\ell$ : Instruction embedding
$a$ : Ground-truth action sequence
$n$ : Gaussian noise sample
$\tau$ : Diffusion time step ( $\sim \text{Beta}(1.5, 1)$ )
$V_\theta$ : Velocity prediction network
$m$ : Mask vector ( $m = 1^L$ for SFM; $m_l$ sampled as above for AFM)
$\odot$ : Element-wise multiplication

When $m = 1^L$ , this reduces to standard SFM; for partial $m$ , it implements AFM. The expectation averages over random noise, mask, and diffusion steps.

3. Unified Training Algorithm

Training proceeds by drawing mixed-mode minibatches that interleave SFM and various AFM regimes. The per-iteration procedure is:

Sample batch of size $B$ from dataset $\mathcal{D}$ consisting of $(o_t, a, \ell)$ tuples.
For each sample $i$ $i$ :
- Sample $y_i \sim \text{Uniform}(0,1)$ .
- For each token $l$ , sample mask $m^{(i)}_l \sim \text{Bernoulli}(y_i)$ .
- Sample $\tau_i \sim \text{Beta}(1.5, 1)$ ; noise $n^{(i)} \sim \mathcal{N}(0, I)$ .
- Compute velocity target $u^{(i)} = n^{(i)} - a^{(i)}$ .
- Generate noisy actions $\hat{a}^{(i)}(\tau_i) = a^{(i)} - \tau_i (a^{(i)} - n^{(i)}) \odot m^{(i)}$ .
- Predict $v^{(i)} = V_\theta(o^{(i)}_t, \ell^{(i)}, \hat{a}^{(i)}(\tau_i))$ .
Compute $\frac{1}{B} \sum_{i=1}^B \| (v^{(i)} - u^{(i)}) \odot m^{(i)} \|_2^2$ and update $\theta$ .

Because $m$ sometimes equals $1^L$ , SFM gradient signals are always included (Jiang et al., 18 Nov 2025).

4. Data Processing and Batching

Fixed-length action chunks of length $L$ are extracted from demonstration trajectories, along with matching observation sequences and instruction embeddings:

Actions are normalized (zero mean, unit variance).
Chunks shorter than $L$ are padded; the corresponding mask entries are set to 0 to avoid incurring spurious loss.
Batches are assembled at the chunk level and shuffled each epoch.

This batching approach supports efficient interleaving of SFM and AFM within large-scale distributed training runs.

5. Confidence Rater Module

After model training, a separate confidence rater $f_\phi$ is developed to enable mask-driven self-correction at inference:

Consists of 4 transformer layers (hidden-size $d$ , 32 heads, FFN width 6144), linear + sigmoid output head.
Input: token embeddings of $o_t$ , $\ell$ , and first-round SFM-predicted actions.
For each token $l$ , per-token MSE is computed between first-pass prediction $\hat{a}_{SFM,l}$ and ground truth $a_l$ , then rescaled as:

$q_l = 1 - \alpha - \beta\frac{e_l - \min_e}{\max_e - \min_e} + \epsilon$

with $\alpha=0.01$ , $\beta=0.98$ , $\epsilon=10^{-6}$ .

$L_{rater}(\phi) = \mathbb{E}_{chunks} \sum_{l=1}^L (p_l - q_l)^2$ .

At test time, tokens with predicted confidence $p_l < T$ ( $T = 0.5$ ) are masked for asynchronous self-correction passes.

6. Optimization, Hyperparameters, and Inference

Optimization:

Pre-training: Batch size 2048, AdamW optimizer ( $\beta_1=0.9, \beta_2=0.999$ , weight decay 0).
Learning rates: $1\times10^{-4}$ for text and FM head; $2\times10^{-5}$ for vision encoder.
Cosine LR decay schedule, pre-training for $\sim 200$ epochs; 20–50 epochs for fine-tuning.

Diffusion and Masking:

10-step uniform diffusion schedule.
$\tau \sim \text{Beta}(1.5, 1)$ biases early training to high-noise states.
No explicit curriculum: stochastic $y$ ensures uniform mixing of SFM and AFM.

Inference:

SFM generates initial actions for the whole chunk.
Confidence rater identifies low-confidence tokens.
AFM refines only the masked slots, reusing the KV-cache from the SFM pass for computational efficiency.

7. Practical Implications and Considerations

Unified SFM+AFM training, as instantiated in AsyncVLA (Jiang et al., 18 Nov 2025), allows a single VLA model to flexibly operate in both fully synchronous and mask-driven asynchronous regimes. Notably, this approach:

Achieves self-correction in long-horizon tasks by localized action refinement.
Eliminates the need for separate heads or alternating training schedules.
Enhances data efficiency and model generalization without imposing extra curriculum learning mechanisms.
Improves KV-cache utilization during inference, with only masked action tokens requiring recomputation.

A plausible implication is that unified SFM/AFM procedures generalize beyond robotic embodied action, providing a template for training sequence models that require both global and local token regeneration capabilities within a single framework. This methodological innovation directly enables state-of-the-art results on generalist robotics tasks through improved stability and adaptability of action generation (Jiang et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Unified Training Procedure for SFM and AFM.