GenieReasoner: Unified Vision-Language-Action Model

Updated 5 January 2026

GenieReasoner is a unified vision-language-action model that integrates semantic reasoning and continuous robotic control in open-world environments.
The FACT module discretizes continuous robot trajectories into compact tokens, enabling sub-millimeter reconstruction accuracy and efficient decoding.
ERIQ benchmark quantitatively assesses reasoning capabilities by decoupling semantic reasoning from low-level actuation noise, linking abstraction to manipulation success.

GenieReasoner is a unified vision-language-action (VLA) model designed for general-purpose robotic manipulation in open-world environments. It jointly optimizes the semantic reasoning capabilities characteristic of large vision-LLMs (VLMs) and the high-fidelity continuous control required for robotic action. GenieReasoner achieves this by discretizing both reasoning and motor actions into a common autoregressive token space, allowing a single transformer model to operate seamlessly across both domains. Two central contributions underpin this framework: the FACT (Flow-matching Action Tokenizer) module for compact, accurate action discretization, and ERIQ (Embodied Reasoning Intelligence Quotient), a benchmark for quantitatively separating reasoning from low-level execution (Liu et al., 30 Dec 2025).

1. System Architecture

GenieReasoner is an autoregressive transformer that fuses high-level multimodal reasoning with low-level robotic action. The model operates by encoding sensor and language inputs into discrete tokens, and predicting action tokens which are subsequently decoded into continuous robot trajectories.

Vision–Language Backbone: Processes egocentric image streams $\{I_t\}$ and natural-language instructions $l$ , alongside a context of prior action tokens, to autoregressively predict the next discrete action tokens.
FACT Module: FACT discretizes continuous robot control trajectories $a_{0:H} \in \mathbb{R}^{H \times S}$ into compact discrete codes $c \in \{1, ..., V\}^L$ using a cross-attention-based encoder with $L$ learnable queries and $D$ -dimensional embeddings. A flow-matching decoder reconstructs continuous actions from tokens at inference.
Pipeline: During training, demonstration trajectories are encoded to code sequences. The backbone is trained to predict the next code token, while FACT's decoder is trained to reconstruct the original continuous segment using a flow-matching loss. At inference, the model predicts an action-token sequence, which FACT decodes into executable robot commands.

2. Training Objectives and Mathematical Framework

The GenieReasoner model is trained using a suite of objectives to ensure both semantic reasoning and precise action execution:

Autoregressive Token Prediction: For each trajectory $\zeta$ in dataset $\mathcal{D}$ , the model maximizes the log-likelihood of token sequences:

$\theta^* = \arg\max_\theta \mathbb{E}_{\zeta \sim \mathcal{D}}\left[ \sum_{\tau=1}^{|\zeta|} \log p_\theta( c_{\tau:\tau+\ell} \mid I_\tau, l, \text{prior-tokens} ) \right]$

FACT Encoder Quantization: The encoder maps action chunks to bit-quantized embeddings $c = \text{sign}(e)$ , where $e \in \mathbb{R}^{L \times D}$ .
Flow-Matching Decoder Loss: The decoder $D_\theta(a^{(t)}, c, t)$ is trained to match the true velocity field

$L_\text{flow} = \mathbb{E}_{a, z \sim \mathcal{N}, t \sim U[0,1]} \left[ \| (a - z) - D_\theta(a^{(t)}, c, t) \|_2^2 \right]$

where $a^{(t)} = (1-t)z + ta$ , $z \sim \mathcal{N}(0, I)$ .

Quantizer Regularization: To enforce code usage and embedding commitment:
- Entropy loss $L_\text{entropy}$ encourages code diversity.
- Commitment loss $L_\text{commit}$ promotes embedding fidelity.
Joint Training: FACT is pre-trained separately, then joint optimization interleaves generic VQA, embodied VQA, and tokenized action data, with all objectives backpropagated through discrete codes using a straight-through estimator.

3. FACT: Flow-Matching Action Tokenizer

FACT addresses the challenge of mapping continuous robotic control to a discrete token space with minimal loss.

Code Construction: Each action chunk (e.g., 8 dimensions: end-effector pose + gripper) is compressed to $L = 20$ queries of $D = 12$ bits. Bit-plane encoding groups bits to $V = 4096$ codewords.
Discretization Fidelity: FACT achieves order-of-magnitude lower mean squared error (MSE) in reconstructing actions versus prior methods (e.g., FAST+) at the same code length. Sub-millimeter reconstruction accuracy is attained with only 20 tokens.
Flow-Based Decoding: The use of a flow-matching ODE decoder enables direct mapping from discrete tokens to continuous trajectories, decoupling token sequence length from trajectory resolution and enabling high-precision control.

4. ERIQ: Embodied Reasoning Intelligence Quotient Benchmark

ERIQ is a large-scale benchmark explicitly designed to decouple reasoning evaluation from execution artifacts.

Question Structure: 6,052 multiple-choice or binary (Yes/No) QA pairs across four dimensions:
1. Spatial Perception & Grounding: Includes category identification, scene layout, task/position grounding, dual-view matching.
2. Task Planning & Monitoring: Encompasses subtask decomposition, action understanding, progress estimation.
3. Error Detection & Recovery: Assesses mistake identification, classification, and recovery strategy.
4. Human Intent Understanding: Measures intent inference and joint action prediction.
Metric: ERIQ score is the aggregate accuracy across all items; it uniquely enables analysis of reasoning without conflating with low-level actuation noise.

5. Empirical Results and Ablation Studies

GenieReasoner achieves robust performance across diagnostic reasoning and real-world control metrics.

ERIQ Performance: GenieReasoner (3B) achieves 82.7% accuracy on ERIQ (vs. 58.6% for base models). Notable subtasks: 96.7% in Action Understanding, 96.4% in Human Intention Comprehension, +31% in Dual-view Matching, +25% in Relative Position Grounding.
Open-Set Robotic Control: On the AgiBot G₀ platform (five generalization settings), GenieReasoner consistently attains high language-following scores ( $\approx$ 0.85) and task success rates ( $\approx$ 0.80), outperforming both purely continuous (π₀, π₀.₅, GR00T) and discrete (π₀–FAST) baselines.
Tokenizer Ablation: FACT outperforms FAST+ in MSE for any code length; optimal configuration ( $L=20$ , $V=4096$ ) achieves high precision with extremely short token sequences.
Training Protocol Ablation: Embodied VQA is essential for reasoning (ERIQ +24pp when included). ActionTok alone enables control but with zero reasoning. Joint (EmbVQA+ActionTok) pre-training produces strong reasoning and execution; maintaining embodied reasoning data in post-training is key for optimal performance.

6. Analysis, Significance, and Future Directions

GenieReasoner provides a principled resolution to the reasoning-precision bottleneck in VLA models by unifying language and action into a discrete, autoregressive modality. FACT shifts the fine-control burden to a continuous flow-based decoder, facilitating both semantic abstraction and precise execution without increasing sequence length. ERIQ unambiguously demonstrates that upstream reasoning proficiency predicts downstream manipulation success.

Future research avenues articulated include: deeper integration of chain-of-thought reasoning with action sequence generation (e.g., interleaved CoT and action token prediction), hierarchical multi-scale action tokenization (e.g., layered FACT modules), domain-adaptive online code learning, and ERIQ expansion to longer-horizon, deformable-object, and multi-agent tasks (Liu et al., 30 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GenieReasoner System.