Flow-matching Action Tokenizer (FACT)

Updated 25 May 2026

FACT is a discrete action tokenization approach that maps continuous control trajectories to geometry-aware token sequences, enabling high-fidelity control in autonomous systems.
It employs flow-matching objectives and a two-stage quantization process to bridge the gap between discrete actions in VLA models and the precision needed for real-world tasks.
Empirical results demonstrate significant improvements in trajectory reconstruction and policy performance on benchmarks for autonomous driving and robotic manipulation.

The Flow-matching Action Tokenizer (FACT) is a discrete action tokenization approach that enables high-fidelity mapping between continuous control trajectories and compact, geometry-aware token sequences. FACT is fundamentally characterized by the use of flow-matching objectives and structure-preserving quantization, bridging the gap between the discrete action spaces favored by vision-language-action (VLA) models and the precision demands of real-world autonomous agents. It provides a versatile backbone for end-to-end learning in both autonomous driving and robotic manipulation, supporting coarse-to-fine action inference, parallelized decoding, and seamless integration with multimodal policy architectures (Xu et al., 5 Dec 2025, Liu et al., 30 Dec 2025).

1. Mathematical Foundations: Flow-matching Objective

FACT builds on the discrete flow-matching principle, wherein the generative path from a simple prior (e.g., uniform noise or masked tokens) to empirical trajectory distributions is defined via parameterized conditional probability flows. Given a tokenized action space $S = \mathcal{T}^D$ , where $\mathcal{T}$ is a token vocabulary (numerical, angular, or semantic) and $D$ is the trajectory dimensionality, FACT defines a time-indexed probability path:

$p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$

with $p_0(x)$ as the prior and $p_1(x) = q(x)$ as the dataset trajectory distribution (Xu et al., 5 Dec 2025).

For numerical and angular tokens, the conditional path $p_t(x|x_1)$ is set via a geometry-aware Gibbs kernel:

$p_t(x|x_1) = \text{softmax}(-\beta_t d(x, x_1))$

where $d(x, x_1)$ is a weighted coordinate-wise dissimilarity. The underlying continuous-time Markov chain (CTMC) is specified by transition rates:

$u_t(x, z | x_1) = p_t(x | x_1) \cdot \dot{\beta}_t [d(z, x_1) - d(x, x_1)]_+$

Transitions that reduce distance to the expert trajectory are favored. Neural approximation of the posterior $\mathcal{T}$ 0 is trained via cross-entropy minimization:

$\mathcal{T}$ 1

(Xu et al., 5 Dec 2025).

In the robotic domain, FACT formalizes the flow between noise and trajectory via a linear interpolation $\mathcal{T}$ 2 for $\mathcal{T}$ 3, and optimizes a rectified flow loss:

$\mathcal{T}$ 4

where $\mathcal{T}$ 5 is the flow-matching decoder and $\mathcal{T}$ 6 is the quantized code (Liu et al., 30 Dec 2025).

2. Discretization: Metric-aligned Quantization and Token Embeddings

FACT employs a two-stage quantization process:

Uniform Scalar Codebook (Autonomous Driving): Scalar values (e.g., coordinates, headings) are quantized via nearest-neighbor lookup in a fine-grained codebook $\mathcal{T}$ 7 (e.g., $\mathcal{T}$ 8 in $\mathcal{T}$ 9 with 0.01 spacing). The token ID $D$ 0 is selected by $D$ 1.
Sign-based Bit Quantization (Robotics): A transformer-style VQ-encoder compresses continuous action chunks $D$ 2 to latent $D$ 3, which is quantized elementwise via $D$ 4. Each $D$ 5-row bit vector is mapped to an integer token in $D$ 6 (Liu et al., 30 Dec 2025).

To ensure the token embeddings respect underlying geometry:

A linear embedding head $D$ 7 (scalar) or $D$ 8 (chunk) is applied, followed by $D$ 9 normalization.
For scalars, triplet-margin ranking loss enforces the embedding distance $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 0 to correlate with true scalar differences $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 1, with margin $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 2 (Xu et al., 5 Dec 2025).
For robot control tokens, entropy and commitment regularizers are included to promote codebook coverage and embedding stability.

3. Decoding and Trajectory Reconstruction

Decoded token sequences are reconstructed to continuous trajectories through flow-based integration:

In the WAM-Flow context, the inference loop samples initial tokens, then applies a parallel, Euler-style CTMC integration to update each coordinate in coarse-to-fine steps, leveraging the learned posterior and geometry-aware transition rates (Xu et al., 5 Dec 2025).
In GenieReasoner, autoregressively decoded token sequences are mapped back to codes, and a learned ODE is integrated from Gaussian noise using the flow decoder $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 3. The result is a reconstructed trajectory $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 4 in $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 5 (Liu et al., 30 Dec 2025).

The architecture supports parallel decoding, coarse-to-fine trajectory refinement, and flexibility in the number of integration steps, yielding tunable trade-offs between compute and accuracy.

4. Integration into Policy Architectures

FACT supports seamless policy learning via tokenized action spaces:

WAM-Flow: FACT enables non-causal (bidirectional) trajectory refinement, supporting parallel decoding of all action dimensions. It is trained jointly with vision-language context and simulator-guided reinforcement (GRPO), achieving state-of-the-art closed-loop planning on the NAVSIM v1 benchmark, with PDMS scores of 89.1 (1-step) and 90.3 (5-step) (Xu et al., 5 Dec 2025).
GenieReasoner: FACT forms the low-level motor policy head in a unified VLM–action model, based on a Multimodal Diffusion Transformer backbone. Reasoning gradients and action gradients are aligned through a shared autoregressive training objective over both language and action tokens. This yields strong performance on the ERIQ benchmark as well as real-world tabletop robotic tasks (Liu et al., 30 Dec 2025).

The following table summarizes key FACT integration aspects:

System	Discretization	Decoding Mechanism	Policy Training Schema
WAM-Flow	Scalar VQ	Parallel CTMC-based flow	Non-causal, flow-matching
GenieReasoner	Sign-bit VQ	ODE-based flow integration	Multitask cross-entropy

5. Empirical Evaluation and Ablations

FACT demonstrates substantial empirical gains across both domains:

Autonomous Driving: Table 7 in (Xu et al., 5 Dec 2025) shows PDMS performance increases from 76.2 (text tokenizer) to 81.1 (numeric tokenizer without metric alignment), then to 83.4 with metric-aligned embeddings, and up to 90.3 after full multimodal and reinforcement training. The improvement confirms the necessity and effectiveness of geometry-aware tokenization.
Robotics: FACT yields trajectory reconstruction MSE of $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 6 at $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 7 code length, an order of magnitude better than FAST+ [Pertsch et al. '25] for the same token budget. On the ERIQ benchmark, GenieReasoner with FACT achieves an accuracy boost from 58.64% (Qwen2.5-VL-3B) to 82.72%, with action understanding at 96.67%. Real-robot experiments show language-following and success rates comparable to continuous policies, significantly outperforming discrete FAST+ (Liu et al., 30 Dec 2025).
Computation: The integration of a flow ODE at inference adds approximately 20–50 ms per action chunk. Ablation studies reveal a code length "sweet spot" at $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 8 ( $p_t(x) = \sum_{x_1 \in S} q(x_1) \, p_t(x | x_1)$ 9 vocabulary), beyond which fidelity gains plateau while predictive difficulty escalates.

6. Limitations, Regularization, and Future Perspectives

FACT introduces numerical integration as a computational bottleneck—acceptable in low-rate scenarios but necessitating further optimization (e.g., faster implicit solvers, learned integration schedules) for high-frequency control. It leverages geometry-regularization directly within the flow-matching objective, obviating auxiliary regularizers. Future work identified by the authors includes deeper integration of "Chain-of-Thought ↔ Flow" connections, and exploration of intermediate flow-decoder states as forms of physical reasoning.

A plausible implication is that FACT, by integrating geometry-aware tokenization with expressive flow-based decoding, establishes a powerful framework unifying tokenized reasoning and fine-grained action generation for embodied AI applications—demonstrated by its contributions to both closed-loop autonomous driving and general-purpose robotic manipulation (Xu et al., 5 Dec 2025, Liu et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving (2025)

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-matching Action Tokenizer (FACT).