Unified Vision-Language-Action Tokenization

Updated 25 February 2026

Unified VLA tokenization is a method that quantizes vision, language, and action into discrete tokens for a unified transformer-based multimodal architecture.
It enables seamless cross-modal grounding, symbolic reasoning, and sub-task planning by aligning diverse modalities through a shared vocabulary.
The approach overcomes traditional pipeline inefficiencies, enhancing real-world and simulated task performance via flexible, hierarchical control strategies.

Unified Vision-Language-Action (VLA) Tokenization is a methodology by which visual percepts, linguistic inputs, and continuous robot actions are mapped into a single, integrated stream of discrete tokens drawn from a shared vocabulary. This token-level unification enables a transformer-based architecture to process multimodal context and sequentially (or in parallel) generate outputs spanning symbolic reasoning, sub-task planning, and low-level motor commands. Unified VLA tokenization enables seamless information flow, cross-modal grounding, and flexible downstream optimization across both simulation and real-world embodied environments.

1. Foundational Principles and Motivations

Unified VLA tokenization arose to resolve failures and inefficiencies in traditional, pipeline-based embodied AI systems. Classic architectures separated perception (vision), instruction following (language), and low-level actuation (action) via hand-engineered interfaces and module-specific representations. This caused brittle module boundaries, compounded error propagation, and stunted generalization—particularly in long-horizon, multi-step tasks (Yang et al., 31 May 2025).

Recent advances in large vision-LLMs (VLMs) exposed an opportunity: by mapping all modalities into token IDs from a superset vocabulary, one can train a single generic transformer for joint vision, language, and action reasoning. This shared token space enforces cross-modal representation learning at the foundation level and endows the architecture with a natural way to operate over multi-modal contexts (Wang et al., 24 Jun 2025, Chen et al., 3 Nov 2025).

Unified VLA tokenization is thus designed to:

Provide explicit alignment between perception, language, and control signals.
Enable direct policy learning, video prediction, and world modeling from tokenized demonstrations.
Allow flexible supervision, multi-objective optimization, and hierarchical control strategies (Zhong et al., 2 Jul 2025).

2. Modality-Specific Tokenization and Codebook Design

Each input/output modality undergoes a discipline-specific quantization and embedding process before concatenation into the unified token stream:

Vision: Images are partitioned into fixed patches (e.g., 8×8 or 16×16). Each patch is passed through a frozen visual encoder (e.g., SigLIP, VQ-VAE, MOVQ) to produce a local descriptor, which is then quantized against a vision codebook: $v_t^i = \arg\min_{j} \| \text{Enc}(x_t)^i - c_j \|_2, \quad i=1..N_p,$ where $c_j$ is a codebook vector and $N_p$ is the number of patches. The resulting discrete indices $v_t^i$ form the visual token sequence (Wang et al., 24 Jun 2025, Chen et al., 3 Nov 2025).

Language: Goals, instructions, and sub-tasks are tokenized using standard BPE/Llama tokenizers, assigning each subword unit or special symbol an integer ID $l_j \in \{0, ..., |V|_{lang}-1\}$ (Yang et al., 31 May 2025).

Action: Continuous actions (e.g., $a \in \mathbb{R}^m$ ) are quantized through several mechanisms:

Uniform binning (LoHoVLA): $q_j = \text{round}(a_j \cdot (B-1)) \in \{0,...,B-1\}$ , with $B$ bins per dimension (Yang et al., 31 May 2025).
DCT+FAST coding (UniVLA, UD-VLA): Joint trajectories are transformed with DCT, then quantized and byte-pair encoded to yield compact, temporally-aware discrete tokens (Wang et al., 24 Jun 2025, Chen et al., 3 Nov 2025).
Behavioral quantization (OmniJARVIS): An FSQ encoder compresses $T$ -step behavior into a product codebook, generating a small set of semantically rich tokens (Wang et al., 2024).
Object-agent centric pooling (Oat-VLA): Visual tokens are pooled over detected objects and the gripper region, sharply reducing token count without sacrificing semantics (Bendikas et al., 28 Sep 2025).

All modalities share a unified embedding table $E \in \mathbb{R}^{|V|\times d}$ , where each token index (from any source) is mapped to a $c_j$ 0-dimensional representation. Special marker tokens (e.g., <BOS>, <EOACT>, <EOTASK>, <B0I>) denote segment boundaries for loss isolation and inference scheduling (Yang et al., 31 May 2025, Chen et al., 3 Nov 2025).

3. Unified Token Sequence Construction and Model Architecture

Unified VLA models arrange tokens into a single sequence that preserves temporal, causal, and modal dependencies. Canonical layouts include:

Stepwise Concatenation:

$c_j$ 1

for each $c_j$ 2 (Yang et al., 31 May 2025), or

$c_j$ 3

where $c_j$ 4, $c_j$ 5, $c_j$ 6 denote language, vision, and action tokens per step (Wang et al., 24 Jun 2025).

Interleaved Multimodal Streams:

Sequences can combine language tokens (goal), vision (current and future), and actions into a single flattened chain for training via next-token (auto-regressive) or diffusion objectives: $c_j$ 7 augmented with begin/end tokens for demarcation (Chen et al., 3 Nov 2025).

Object- and Agent-centric Prefixing:

In Oat-VLA, an object-centric pooling of patch tokens is concatenated with gripper-centric tokens, all prepended to language tokens and presented jointly to the transformer (Bendikas et al., 28 Sep 2025).

The embedding vector for the $c_j$ 8-th token is

$c_j$ 9

where $N_p$ 0 is the shared embedding table, $N_p$ 1 encodes position, and $N_p$ 2 labels the source modality (Chen et al., 3 Nov 2025).

Model Structure:

The composed sequence feeds into a single transformer encoder/decoder stack that supports:

Fully shared attention across all tokens (enabling cross-modal fusion and grounding).
Modality-specific causal masks (to respect reasoning direction and inference flow) (Chen et al., 3 Nov 2025).
Hierarchical closed-loop scheduling (at inference, selectively re-planning sub-tasks or re-decoding actions as dictated by reward feedback and failure counts) (Yang et al., 31 May 2025).

4. Decoding Strategies: Autoregressive, Diffusion, and Closed-loop Integration

Unified VLA models use a variety of decoding paradigms adapted to their problem structure:

Autoregressive Decoding:

Tokens are generated strictly left-to-right, with the transformer outputting one token per step given previous context. This is used in LoHoVLA for joint sub-task and action token emission (Yang et al., 31 May 2025), in UniVLA for state prediction and policy learning (Wang et al., 24 Jun 2025), and in OmniJARVIS for long-horizon instruction-following (Wang et al., 2024).

Discrete Diffusion Decoding:

Action (or action+future vision) tokens are jointly masked and iteratively denoised in parallel, following a discrete Markov chain: $N_p$ 3 with $N_p$ 4 the mask token. At each step, the transformer predicts distributions over all masked positions, adaptively commits high-confidence tokens, and remasks uncertain slots (Liang et al., 27 Aug 2025, Chen et al., 3 Nov 2025). This removes left-to-right bottlenecks and permits error correction across inference rounds.

Hybrid Closed-Loop Control:

In, e.g., LoHoVLA, token decoding is organized into alternating planning and control loops: a sequence of sub-task tokens is generated and re-used for several control steps unless failures occur, in which case sub-tasks are re-planned. This two-level scheme boosts task robustness and sample efficiency (Yang et al., 31 May 2025).

5. Empirical Outcomes, Efficiency, and Design Trade-offs

Unifying VLA tokenization yields distinct architectural, practical, and empirical benefits:

Model/Strategy	Efficiency	Sample/Task Performance	Special Insights
LoHoVLA	Single stream; no module switching	Stronger than hierarchical baselines	Hierarchical correction loop
UniVLA	Shared codebook; full sequence modeling	95.5% SR on LIBERO	Policy via video tokens
Oat-VLA	2× faster convergence, halved token count	77.1% SR with 16 tokens vs. 256	Object/gripper pooling
VLA-Cache	-27% FLOPs, 1.6× faster, no retraining	<1 pp drop in SR (LIBERO, Jaco)	Token-level reuse
Discrete Diff. VLA	4.7× fewer function evals; parallelized	96.3% SR, improved robustness	Adaptive re-masking
UD-VLA	4× faster inference vs. autogressive	SOTA CALVIN/LIBERO/SimplerEnv	Joint future/action diffusion
SwiftVLA	18× faster, 12× less memory (Jetson)	Comparable to models 7× larger	Fusion+mask-and-reconstruct
OmniJARVIS	128 steps/5 tokens; subtask semantics	59% SR on multitask Minecraft	Chain-of-thought actions

Unified vocabularies improve cross-modal alignment and reduce the cost of transferring or scaling models across tasks and domains (Wang et al., 24 Jun 2025, Chen et al., 3 Nov 2025).
Object-agent centricization sharply improves batching, memory efficiency, and real-world transfer with no loss in behavioral accuracy (Bendikas et al., 28 Sep 2025).
Partial token caching or masking schemes (VLA-Cache) reduce inference cost by up to 27%, supporting real-time control (Xu et al., 4 Feb 2025).
Masked-diffusion decoding avoids sequential bottlenecks and enables stronger error correction via re-masking (Liang et al., 27 Aug 2025, Chen et al., 3 Nov 2025).
Behavioral token quantization (OmniJARVIS) injects actionable semantics and supports interpretable high-level planning, outperforming classical decomposed approaches in complex open-world tasks (Wang et al., 2024).

6. Taxonomy and Comparative Properties of Action Tokenization

A comprehensive survey (Zhong et al., 2 Jul 2025) details eight major classes of action tokens, which can co-exist within unified token streams:

Token Type	Formalization	Consumed by
Language Description	Discrete text sequence	Skill/look-up/policy
Code	Executable program	Interpreter/planner
Affordance	2D/3D mask/coords	Motion planner
Trajectory	Time sequence of states	Controller
Goal State	Future image/video	Planner/predictor
Latent Representation	VQ/VAE code	Decoder/policy
Raw Action	Discretized/real vector	Robot
Reasoning	Chain-of-thought text	Downstream modules

The taxonomy captures the spectrum from symbolic (language/code) to low-level (raw action) representations, each offering distinct trade-offs in interpretability, generalization, controllability, and sample efficiency. For example, language and code tokens permit high-level, human-readable plans, whereas raw-action tokens deliver maximal precision at the cost of opacity and data requirements (Zhong et al., 2 Jul 2025).

7. Open Challenges and Future Directions

The shift toward unified VLA tokenization introduces a variety of open research problems:

Scaling codebooks and tokenization algorithms to high-dimensional, multimodal data without degrading semantic alignment or computation tractability (Chen et al., 3 Nov 2025).
Learning object-centric and agent-centric tokenization end-to-end, supporting generalization to new scenes, manipulators, or sensor modalities (Bendikas et al., 28 Sep 2025).
Incorporating additional modalities including tactile and audio into a joint token space for richer policy learning (Zhong et al., 2 Jul 2025).
Developing differentiable interfaces across action, reasoning, and perception tokens to enable tighter end-to-end optimization loops.
Advancing efficient, safe, and interpretable tokenization protocols for long-horizon, safety-critical or human-aligned applications.
Exploring adaptive test-time computation (longer reasoning for harder tasks), hierarchical scheduling, and compositional token policy structures (Zhong et al., 2 Jul 2025).

A plausible implication is that future embodied agents will employ multi-tier token representations, leveraging language, affordance, and raw-action tokens interchangeably within the same transformer, leading to improved sample efficiency and robustness in unstructured environments.

Principal References:

[LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks, (Yang et al., 31 May 2025)]
[Unified Vision-Language-Action Model, (Wang et al., 24 Jun 2025)]
[Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models, (Bendikas et al., 28 Sep 2025)]
[A Survey on Vision-Language-Action Models: An Action Tokenization Perspective, (Zhong et al., 2 Jul 2025)]
[Discrete Diffusion VLA, (Liang et al., 27 Aug 2025)]
[VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation, (Xu et al., 4 Feb 2025)]
[OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents, (Wang et al., 2024)]
[SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead, (Ni et al., 30 Nov 2025)]
[Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process, (Chen et al., 3 Nov 2025)]