Unified Training for SFM & AFM
- The paper presents a unified training procedure that integrates Synchronous Flow Matching (SFM) and Asynchronous Flow Matching (AFM) for efficient robotic action generation.
- It employs a joint loss formulation that probabilistically mixes synchronous and asynchronous learning modes through mixed-mode minibatches and a diffusion-based masking strategy.
- The approach enhances model performance by improving data usage, KV-cache efficiency, and enabling self-correction via a dedicated confidence rater module during inference.
A unified training procedure for SFM (Synchronous Flow Matching) and AFM (Asynchronous Flow Matching) enables a single model to support both uniform and asynchronous token-level action generation within vision-language-action (VLA) frameworks. This paradigm facilitates joint learning of temporally rigid and context-adaptive action policies, resulting in improved model efficiency, data usage, and self-correction capability during long-horizon robotic tasks (Jiang et al., 18 Nov 2025). The following presents a technical overview of unified SFM/AFM training, including the underlying objectives, joint loss formulation, practical implementation, and architectural modules.
1. Foundations: SFM and AFM
SFM and AFM are distinct approaches for flow matching in trajectory or sequence modeling, particularly for action generation in robotic agents:
- Synchronous Flow Matching (SFM): Every action token in a trajectory is denoised simultaneously during training (i.e., the entire action sequence is generated at each step).
- Asynchronous Flow Matching (AFM): A random subset of action tokens is masked and regenerated at each iteration, while unmasked tokens remain fixed, thus introducing temporal selectivity and allowing localized refinement.
The core difference lies in the mask vector . For SFM, ; for AFM, entries of are sampled independently via , with (Jiang et al., 18 Nov 2025).
2. Unified Loss Formulation
Unified SFM/AFM training leverages a single stochastic objective that subsumes both modes by probabilistically mixing synchronous and asynchronous supervision in each minibatch. The total training loss is given by:
- : Multi-view observation (images and robot state)
- : Instruction embedding
- : Ground-truth action sequence
- : Gaussian noise sample
- : Diffusion time step ()
- : Velocity prediction network
- : Mask vector ( for SFM; sampled as above for AFM)
- : Element-wise multiplication
When , this reduces to standard SFM; for partial , it implements AFM. The expectation averages over random noise, mask, and diffusion steps.
3. Unified Training Algorithm
Training proceeds by drawing mixed-mode minibatches that interleave SFM and various AFM regimes. The per-iteration procedure is:
- Sample batch of size from dataset consisting of tuples.
- For each sample :
- Sample .
- For each token , sample mask .
- Sample ; noise .
- Compute velocity target .
- Generate noisy actions .
- Predict .
- Compute and update .
Because sometimes equals , SFM gradient signals are always included (Jiang et al., 18 Nov 2025).
4. Data Processing and Batching
Fixed-length action chunks of length are extracted from demonstration trajectories, along with matching observation sequences and instruction embeddings:
- Actions are normalized (zero mean, unit variance).
- Chunks shorter than are padded; the corresponding mask entries are set to 0 to avoid incurring spurious loss.
- Batches are assembled at the chunk level and shuffled each epoch.
This batching approach supports efficient interleaving of SFM and AFM within large-scale distributed training runs.
5. Confidence Rater Module
After model training, a separate confidence rater is developed to enable mask-driven self-correction at inference:
- Consists of 4 transformer layers (hidden-size , 32 heads, FFN width 6144), linear + sigmoid output head.
- Input: token embeddings of , , and first-round SFM-predicted actions.
- For each token , per-token MSE is computed between first-pass prediction and ground truth , then rescaled as:
with , , .
- .
At test time, tokens with predicted confidence () are masked for asynchronous self-correction passes.
6. Optimization, Hyperparameters, and Inference
Optimization:
- Pre-training: Batch size 2048, AdamW optimizer (, weight decay 0).
- Learning rates: for text and FM head; for vision encoder.
- Cosine LR decay schedule, pre-training for epochs; 20–50 epochs for fine-tuning.
Diffusion and Masking:
- 10-step uniform diffusion schedule.
- biases early training to high-noise states.
- No explicit curriculum: stochastic ensures uniform mixing of SFM and AFM.
Inference:
- SFM generates initial actions for the whole chunk.
- Confidence rater identifies low-confidence tokens.
- AFM refines only the masked slots, reusing the KV-cache from the SFM pass for computational efficiency.
7. Practical Implications and Considerations
Unified SFM+AFM training, as instantiated in AsyncVLA (Jiang et al., 18 Nov 2025), allows a single VLA model to flexibly operate in both fully synchronous and mask-driven asynchronous regimes. Notably, this approach:
- Achieves self-correction in long-horizon tasks by localized action refinement.
- Eliminates the need for separate heads or alternating training schedules.
- Enhances data efficiency and model generalization without imposing extra curriculum learning mechanisms.
- Improves KV-cache utilization during inference, with only masked action tokens requiring recomputation.
A plausible implication is that unified SFM/AFM procedures generalize beyond robotic embodied action, providing a template for training sequence models that require both global and local token regeneration capabilities within a single framework. This methodological innovation directly enables state-of-the-art results on generalist robotics tasks through improved stability and adaptability of action generation (Jiang et al., 18 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free